Skip to content

Changelog

Unreleased

0.1.8 (2026-04-15)

Changed

  • Project downloads require confirmation: Downloads from project accessions (SRP/ERP/DRP/PRJNA/PRJEB/PRJDB) now always require --yes / -y to proceed, preventing surprise multi-hundred-GiB downloads. The info table is shown for all project downloads so users can review what they're about to download.
  • Lower size confirmation threshold: The size gate for non-project downloads was lowered from 500 GiB to 100 GiB.

Added

  • Disk space check: Downloads now check available disk space in the target directory before starting and bail with a clear error if there isn't enough room.

0.1.7 (2026-04-15)

Fixed

  • PacBio sequence accuracy: Replace quality-based N-masking with ALTREAD 4na ambiguity merge, matching the VDB schema's bit_or(2na, .ALTREAD) derivation. PacBio SRR38107137 drops from 680 to 0 sequence mismatches and 9,324 to 0 quality mismatches vs fasterq-dump. Illumina output remains byte-identical. Closes #4.

0.1.6 (2026-04-15)

Added

  • Dev version strings: Non-release builds now show git SHA and dirty flag (e.g. 0.1.6-dev+abc1234.dirty) via a build script.
  • cSRA rejection: Detect aligned SRA (cSRA) archives and return an actionable error pointing users to fasterq-dump.

Changed

  • Benchmarks: Updated README benchmarks to 8-core results with v0.1.5.
  • Integration tests: Switched from LS454 fixture (SRR000001) to Illumina (SRR28588231) after adding legacy platform rejection.

Fixed

  • Clippy: Fixed collapsible-if and manual-contains warnings from Rust 1.94.
  • PacBio quality decode: Expand page map data_runs for variable-length rows.

0.1.5 (2026-04-14)

Added

  • Benchmarks: Added validation/benchmark.sh script comparing sracha against fastq-dump and fasterq-dump, and added benchmark results to README.
  • Graceful Ctrl-C handling: The get command now cancels in-flight downloads cleanly on SIGINT.

Changed

  • Progress bars: Switched to Unicode thin-bar style and extracted shared progress bar helper.
  • MIT license: Added LICENSE file.

Fixed

  • Cursor tests: Fixed temp file name collision in parallel cursor tests.

0.1.4 (2026-04-14)

Performance

  • Gzip backpressure: ParGzWriter now blocks when too many blocks are pending, preventing the decode loop from outrunning compression. Eliminates a multi-second finish() stall and reduces overall decode+gzip time by ~47% (19s to 10s on SRR000001).

0.1.3 (2026-04-14)

Performance

  • Thread-local compressor reuse: Gzip compression reuses libdeflater Compressor and output buffer across blocks via thread-local storage, avoiding ~300 KiB malloc/free per 256 KiB block.
  • Cap gzip thread pool: Compression pool threads are now capped at available_parallelism() to prevent oversubscription.
  • Lazy quality fallback buffer: The lite quality buffer is only allocated when quality data is actually missing, skipping ~300 KiB per blob in the common case.
  • Inline izip type 0 reads: Eliminated intermediate Vec<i64> allocations in izip decode by reading packed values directly from raw buffers during output reconstruction.
  • Zero-copy blob data: DecodedBlob now borrows data directly from mmap'd slices via Cow<'a, [u8]>, eliminating ~9% of heap allocations.
  • Multi-accession download prefetch: When processing multiple accessions, the next file's download starts while the current one is being decoded, overlapping network and CPU.

Changed

  • Added profiling cargo profile (optimized, no LTO) for heap profiling with valgrind/dhat.

Fixed

  • Illumina tile boundaries: Fixed skey id2ord delta unpacking to use big-endian bitstream order matching ncbi-vdb's Unpack function. Tile assignments at spot boundaries are now correct. Also fixed span_bits header offset for v2 index files. Closes #3.
  • Per-spot template selection: Name templates are now looked up per spot (not per blob), so tile transitions within a blob produce correct deflines.
  • Fixed spot length for v1 blobs: When READ_LEN is absent, the v1 blob header row_length is now used as a fallback for fixed spot length detection, enabling correct spot splitting without API access.
  • irzip v3 dual-series decoding: Implemented the series_count=2 path for irzip decompression, fixing X/Y coordinate decoding for blobs that use interleaved dual-series delta encoding.
  • X/Y page map expansion: X and Y column values are now expanded via page map data runs, matching the existing READ_LEN expansion logic.

0.1.2 (2026-04-14)

Added

  • Direct S3 fetch: Downloads now probe the NCBI SRA Open Data S3 bucket directly, skipping the SDL API round-trip. Falls back to SDL automatically when the direct URL is unavailable (old/non-public accessions). Stable URLs also improve resume reliability vs. expiring presigned SDL URLs. Use --prefer-sdl to opt out.

Changed

  • Simplify KAR/VDB parsing: Unified duplicated PBSTree parsers across kar.rs and metadata.rs into a single shared implementation. Removed dead code (unused metadata children parsing, leftover debug logging), eliminated unnecessary temporary allocations in idx2 block decoding, and moved test-only functions (unpack, read_blob_for_row) behind #[cfg(test)]. Net reduction of ~220 lines with identical output.
  • Batch API calls for info and get: Multi-accession and project queries now resolve all runs in 2 HTTP requests (1 SDL + 1 EUtils) instead of 2N sequential calls. Significantly faster for projects with many runs.
  • Improved error messages: Not-found accessions now include an NCBI search link to help verify the accession exists.

0.1.1 (2026-04-13)

Added

  • FASTA output mode: --fasta flag on fastq and get commands outputs >defline\nsequence\n records instead of FASTQ. Skips quality column decode entirely for faster conversion when quality scores are not needed.
  • zstd compression: --zstd flag on fastq and get commands uses zstd compression instead of gzip. Native multi-threaded compression via the zstd crate. Configurable level with --zstd-level (1-22, default 3). Produces .fastq.zst or .fasta.zst output files.
  • validate subcommand: sracha validate <file.sra> verifies SRA file integrity by opening the KAR archive, parsing the SEQUENCE table, and decoding all blobs in parallel without producing output. Reports columns found, spot/blob counts, and any decode errors. Exits with code 1 on failure.
  • Resume interrupted downloads: Downloads now resume automatically. Completed files are skipped (verified by size + MD5). Parallel chunked downloads track progress in a .sracha-progress sidecar file; on retry, only incomplete chunks are re-downloaded. Single-stream downloads resume via HTTP Range. Use --no-resume to force a fresh download.

Changed

  • Compression is now configured via a CompressionMode enum (None, Gzip, Zstd) instead of separate --gzip / --no-gzip boolean flags. Existing flag behavior is preserved: gzip is the default, --no-gzip disables compression, --zstd selects zstd.
  • sracha get temp downloads now preserve partial files on failure for automatic resume on the next attempt.

0.1.0 (2026-04-13)

Added

  • Project-level accessions: sracha get PRJNA675068 and sracha get SRP123456 resolve study/BioProject accessions to constituent runs via NCBI EUtils API.
  • Accession list input: --accession-list flag on get, fetch, and info reads accessions from a file (one per line, # comments supported).
  • Illumina name reconstruction: Deflines now include the original Illumina read name (instrument:run:flowcell:lane:tile:X:Y) reconstructed from the skey index and physical X/Y columns.

Fixed

  • Quality string corruption: Fixed three bugs that could produce invalid FASTQ quality strings causing STAR alignment failures:
  • ASCII quality heuristic now validates all bytes, not just the first 100.
  • Quality offset tracking always advances in the fallback path.
  • format_read validates quality length matches sequence and sanitizes invalid bytes (outside Phred+33 range [33, 126]).
  • N base handling: Bases with quality <= Phred 2 are now emitted as N, matching the NCBI convention for Illumina no-call bases in 2na encoding.
  • Defline format: Output now matches fasterq-dump format (@RUN.SPOT_NUM DESCRIPTION length=LEN) with the + line repeating the full defline.