Large --accession-list runs no longer fail with HTTP 414 URI Too Long
(#64). The SDL locate request is now split into chunks of 100 accessions
instead of packing every accession into one URL, and EUtils RunInfo EFetch
uses HTTP POST (no URL-length limit) instead of GET.
0.3.8 (2026-06-03)
Performance
Sharply lower peak memory during FASTQ decode of runs with many blobs
(#54, #55). The decode→write pipeline buffered formatted FASTQ in
fixed 1024-blob batches with a 4-deep hand-off queue, so a large
full-quality run could hold tens of GiB of decoded output before the
writer drained any of it — SRR36401016 used 19.4 GiB even at
-t 1 --connections 1. The buffer is now bounded by a thread-scaled
batch size ((threads × 8).clamp(64, 256)) with a single queued batch,
making peak decode RSS roughly independent of run size. The reporter
measured 19.4 GiB → 1.1 GiB at -t 1, with a small wall-clock
improvement; output is byte-identical.
Improvements
Better long-read (PacBio / Oxford Nanopore) support. Platform detection
now reads the authoritative col/PLATFORM/row numeric id
(INSDC:SRA:platform_id) before sniffing the schema table name, so runs
submitted as plain FASTQ and loaded under the generic
NCBI:SRA:GenericFastq schema — common for PacBio/ONT — report their real
platform (e.g. PACBIO_SMRT) instead of unknown. The same id feeds the
read-structure fallback, so generic-loaded long-read runs resolve to one
biological read per spot even without read columns. Schema-based
read-structure inference also resolves PacBio and Nanopore schema-tagged
runs to one biological read per spot instead of erroring out to an untyped
fallback, so single-end long-read spots decode with a known read type. The
single-end advisory printed for --split split-3 now also fires for
--split interleaved and points at --split split-spot as the explicit
single-file layout. --seq-defline's $sn is documented as the
platform-native read identifier for PacBio/ONT (e.g. m64012_.../ccs,
ONT <uuid> ch=.. start_time=..); the channel/start-time/ZMW values
long-read platforms embed there are substrings of that name rather than
separate columns. Adds network-gated integration fixtures for a PacBio
SMRT run (SRR38889541) and an Oxford Nanopore run (SRR38892122).
PacBio/Oxford Nanopore CONSENSUS (CCS) support. When a database carries
a CONSENSUS table, sracha now reads its reads from there by default —
mirroring fasterq-dump's insp_db_type() table selection — so FASTQ output
matches fasterq-dump byte-for-byte (verified on DRR032988: 4,004 reads,
identical bases/quality/deflines, with empty consensus rows dropped). The
default defline now follows fasterq-dump's dflt_seq_defline rule: the name
field is emitted for tables that carry spot names (the reconstructed name, or
the spot number as the synthesized fallback) and omitted entirely for
tables with no NAME column (CONSENSUS), instead of repeating the spot number.
Fixes
Decode older PacBio SMRT archives (e.g. DRR032988) that previously failed
with page_map: data_runs has N entries, expected at least M. Variable-length
array columns (READ_START, READ_TYPE, LABEL_LEN/START, RD_FILTER) pack
per-record arrays of differing lengths; the page-map expansion now derives
each physical record's width from lengths/leng_runs instead of assuming a
single fixed row length, so records are replicated to rows correctly.
Stop mis-decoding raw, uncompressed 2na READ payloads. A header-less READ
blob whose bytes happen to parse as a tiny deflate stream (PacBio CONSENSUS
READ) is now recognised as raw when its size matches the expected packed base
count, rather than being collapsed to a few bytes.
Reconstruct native long-read names for PacBio Revio and Oxford Nanopore
GenericFastq (sharq-loaded) runs, which store the read id
(<movie>/<zmw>/ccs, or an ONT UUID) entirely in the skey text index with
no physical NAME column. Two parts: (1) the PBSTree data_idx stride now uses
raw-byte width thresholds (≤256→u8, ≤65536→u16) instead of trans_off's
×4-scaled thresholds, which silently corrupted any node-data region in the
(256, 1024]-byte window (e.g. a Revio skey transition with a 1013-byte
payload); (2) the dense (one-key-per-row) text-index projection
(KPTrieIndex_v2 variant 0: [count][ord2node]) is now decoded to map each
spot to its trie node, so deflines carry the native name and match
fasterq-dump byte-for-byte (SRR38889541, SRR38892122) instead of falling back
to the spot number.
0.3.7 (2026-05-29)
Features
sracha {get,fastq} --seq-defline <TEMPLATE> sets a custom FASTQ/FASTA
defline using fasterq-dump's --seq-defline syntax (#50). Supports
$ac (accession), $si (spot id), $ri (read id), $sn (spot name),
$rl (read length), and $$ for a literal $; the + line mirrors
the template. Templates are validated at startup. $sg (spot-group) is
not supported. Without the flag, output is unchanged. Adds a "Coming
from sra-tools" option-mapping table to the CLI docs.
0.3.6 (2026-05-16)
Features
sracha get --metadata {tsv,json,both} writes a
{accession}.metadata.{tsv,json} sidecar alongside the FASTQ outputs
after a successful decode (#37). Captures BioSample/SAMN, Sample/SRS,
BioProject, library strategy/source/selection/layout, instrument
model, experiment, study, scientific name, tax id, base count, and
release/load dates from the EUtils RunInfo CSV. RunInfo gains 17
optional fields and now derives Default.
sracha get --dry-run resolves accessions and prints what would be
downloaded as TSV (default) or JSON via --dry-run-format, then
exits without downloading or decoding (#38). Honors --prefer-sdl,
--no-runinfo, --prefer-ena, and project/study expansion.
sracha {get,fastq} --paired-suffix {numeric,r} selects between
_1/_2 (default, matches fasterq-dump and ENA filenames) and
_R1/_R2 FASTQ filenames for paired/split outputs (#39). Matches
the Illumina BCL convention many pipelines expect; applies uniformly
to VDB decode, cSRA decode, split-files, and the ENA fast path.
sracha {get,fastq} --folder-per-accession places each accession's
outputs — FASTQ files, metadata sidecar, completion marker, temp
SRA, .sracha-progress sidecar, and any --keep-sra artifact —
inside its own <output_dir>/<accession>/ subdirectory (#40). The
shared sracha-stats.jsonl audit log stays at the top level so it
aggregates across runs.
0.3.5 (2026-04-26)
Fixes
Honor idx0_count from the v3+ column header so columns whose idx0
carries trailing bytes past the last valid BlobLoc parse cleanly
(#32). Unblocks SRR15000000 and similar newer-writer archives.
Decode variant-2 random-access ALTREAD page maps via
data_offset[row_count] + per-run lengths, with write-time row
dedup (#33). DRR024182 reaches byte-identical R1/R2 vs fasterq-dump.
Walk persisted PTrie nodes to reconstruct full skey templates (the
offset-table fast path only saw leaf suffixes), and drop the ALTREAD
gate from Illumina X/Y detection (#35). DRR016241, DRR032228,
DRR032250, DRR041584, DRR041585, DRR048907 reach PASS_MD5.
Features
sracha get --head-concurrency <N> (default 64) tunes the S3
HEAD-probe fan-out used during accession resolution (#34). Bumps
the built-in pool/probe defaults from 16 → 64.
0.3.4 (2026-04-25)
Fixes
Bound header-driven allocations to prevent SIGABRT on SRA-Lite
quality blobs (#30). All 8 flagged accessions in PRJNA542889
decode under ulimit -v 4000000.
Decode random-access variant-2 page maps by reading the trailing
data_offset[row_count] overlay into data_runs. 6 PASS_CONTENT →
PASS_MD5 in the 100-accession corpus (DRR040793, DRR050206,
DRR036255, DRR036514, DRR040777, DRR041132).
Align READ_LEN with READ by row id rather than blob index. Fixes
truncation on archives where the two columns have mismatched blob
counts; DRR023226 and DRR023232 go from FAIL_COUNT to PASS_MD5.
Read skey templates directly from the offset-indexed string table
and loosen projection-count matching, replacing the byte-scan +
dedup heuristics. DRR035881 and DRR026998 reach PASS_MD5.
Support skey on flat-table archives (DRR019046) and trim adjacent-
template prefix bytes that the backward $X walk swept into the
next template (DRR053011). ~44 PASS_CONTENT → PASS_MD5 in the
random corpus.
Treat ALTREAD raw-passthrough zip blobs (no ops/args, header
osize == on-disk size) as data instead of failing decode. Fixes
DRR019046's lost trailing-N annotations.
Features
NAME_FMT column support: per-spot template overrides reproduce
fine-grained tile interleave on HiSeq archives (DRR040793-class)
that the skey range mapping can't capture. DRR002715 and DRR021982
newly byte-identical.
Emit /N mate suffix in interleaved and split-spot output for
fasterq-dump byte parity in single-stream mode. Split-3 /
split-files paths unchanged.
--stream mode for validation/random_corpus.sh: pipe both
decoders through md5sum instead of writing FASTQs to disk.
4.2× faster (13.6k → 3.3k s on the 100-accession corpus).
0.3.3 (2026-04-24)
Fixes
ALTREAD variable-row padding for N-mask byte-identity:
apply_altread_merge was calling pad_trimmed_rows_fixed with a
uniform row_bases = actual_bases / read_id_range — the average row
length. On Illumina runs with adapter-trimmed reads (per-row base
counts differing by 10–200 bases) any stored record whose trimmed
size exceeded the average errored inside the fixed-pad helper, the
merge silently skipped, and ALTREAD's 4na N annotations leaked
through as raw 2na bases — the N_MASK_ONLY divergences the
mismatch-report harness (#26) captured on DRR035183, SRR33907345,
and every FAIL_SEQ accession reclassified after PR #24.
New PageMap::pad_trimmed_rows_variable takes per-logical-row
targets so each padded row matches its READ row's true width;
apply_altread_merge threads READ's page_map through and feeds its
expanded per-row widths in whenever ALTREAD and READ rows align
1:1. The old fixed path remains the fallback for mismatched-blob-
size layouts (DRR035866's 2:1 ALTREAD-blob case). Verified 100.0%
IDENTICAL on DRR035183 and SRR33907345 vs fasterq-dump 3.2.1
(previously 73.7% / 94.5% N_MASK_ONLY on DRR035183).
READ 2na data_runs expansion for variable-length rows (#22):
when a READ blob's page map has a non-empty data_runs run-length
table, consecutive stored rows with identical 2na bytes are written
once and replicated on read. The expansion path previously
short-circuited whenever lengths wasn't uniform, silently
dropping the duplicated row and producing a SpotCountMismatch
plus asymmetric paired output. SRR33907345 blob 46 is the in-tree
repro: 4,095 stored rows with variable 70–502-base lengths
covering 4,096 logical rows via one data_runs[i]=2 entry. The
decoder now delegates to PageMap::expand_variable_data_runs —
same path the QUALITY column already uses — which handles both
uniform and variable per-row lengths correctly. Covered by the
new variable_length_data_runs_spot_count regression test.
Refactors
CLI utilities moved to sracha-core: thousands and
format_bases live in sracha_core::util alongside format_size;
InfoEntry and the TSV/CSV writer moved into a new
sracha_core::info module with dedicated unit tests. The
tabled-rendered human sracha info table stays in the CLI crate.
Drops ~150 lines from sracha/src/main.rs.
Izip type-0 reconstruction readability: introduced
NbufStream in sracha-vdb::blob to bundle
(data, variant, min, name) so the reconstruction loop reads
naturally (stream.read(idx)?) and out-of-bounds errors identify
which buffer (length / outlier / dx / dy / a / diff / simple) was
truncated.
Documentation
docs/cli.md documents --prefer-ena on sracha get and
sracha fetch; docs/getting-started.md covers the ENA fast path,
strict-integrity default / --no-strict, cSRA decoding,
--prefetch-depth, and --keep-sra.
Removed the orphan docs/implementation.md page; cSRA notes live
in docs/internal/csra-format-notes.md for developers.
CLAUDE.md updated for the three-crate workspace; prior doc
described a two-crate layout and hid sracha-vdb.
0.3.2 (2026-04-24)
Fixes
iunzip raw-passthrough decode (#20): some v2 iunzip blobs — seen
on long-read ENA archives like ERR15141550 — carry osize ==
data.len() with no ops/args because the encoder skipped the
bit-plane + deflate step. decode_irzip_column now detects this
shape and returns the bytes verbatim instead of force-routing them
through irzip_decode with a default planes = 0xFF and failing
with "corrupt deflate stream". Verified byte-identical against
fasterq-dump --split-3 on ERR15141550 (MD5
a063af39f57e9a09edae697fc99674a1).
Writer-closure capture deadlock: when a decode blob returned
Err, the decode_and_write writer thread's early return left
batch_rx alive in the parent stack frame (borrow-capture), so the
decode loop deadlocked on a full batch_tx.send() instead of
propagating the error. Writer now takes batch_rx by move; the
error surfaces cleanly to the caller.
Decoder bounds hardening: nbuf_read, decode_types, and the
izip_decode segment reconstruction loop now return
Error::Format on out-of-bounds / misaligned buffers instead of
panicking a rayon worker.
KAR magic prefix probe on cached skip: download_file accepts
an optional expected_prefix; when the cached .sracha-tmp-*.sra
matches on size but SDL gave no MD5 (multipart upload), sracha now
verifies the first 8 bytes are NCBI.sra before skipping the
download. Closes a secondary path from #20 where a stale temp file
from a crashed prior run fed garbage into the decoder.
0.3.1 (2026-04-19)
Performance
pwrite download writer + read_timeout: per-chunk writer now sends
hyper pieces over a bounded mpsc to a single spawn_blocking task
doing positional write_all_at on a sync std::fs::File, avoiding
tens of thousands of blocking-pool round-trips per download. Added a
15 s read_timeout and 10 s connect_timeout to the reqwest client
so a single stalled TCP connection no longer sets the floor for the
whole parallel download; retry backoff tightened from 2 s/4 s to
250 ms/500 ms. Post-fix on compute18: baseline 10.2 s for 288 MiB,
slow runs capped at ~15 s (previously unbounded).
Benchmarks / docs
End-to-end benchmark stage: new e2e sbatch array index times
the full accession → FASTQ workflow (sracha get vs prefetch +
fasterq-dump vs prefetch + fastq-dump) on SRR28588231 and
SRR2584863.
pixi run install-sratools: pins the reference toolkit
(default sra-tools 3.4.1) into validation/sra-tools/;
benchmark.sh auto-discovers the newest installed version.
README refreshed against sra-tools 3.4.1 on the head node (stable
S3): 11.6× / 4.5× / 4.4× local decode; sracha get 2.9× faster than
prefetch + fasterq-dump on the small accession and 1.55× on the
288 MiB medium.
0.3.0 (2026-04-19)
Added
Broader sracha vdb dump column coverage: name-based heuristic
picks up per-row scalars (PLATFORM, NREADS, SPOT_FILTER,
SPOT_ID, TRIM_LEN, TRIM_START, CLIP_QUALITY_LEFT/RIGHT),
per-read arrays (LABEL_LEN, LABEL_START, POSITION, RD_FILTER),
and ASCII templates (CS_KEY, NAME_FMT) in addition to the
existing SEQUENCE columns. New U8Scalar / U32Scalar cell kinds
render scalars as single numbers instead of one-element arrays. A
hidden --raw flag bypasses type inference and hex-dumps every
column — useful for debugging layouts the heuristic doesn't
recognize. Closes #12.
Reference-compressed cSRA (aligned SRA) decode: archives with a
physical SEQUENCE/col/CMP_READ plus sibling PRIMARY_ALIGNMENT +
REFERENCE tables are now decoded in pure Rust —
NCBI:align:seq_restore_read and NCBI:align:align_restore_read
are both reimplemented (see vdb/restore.rs). sracha fastq on a
cSRA file produces output byte-identical to fasterq-dump
(validated against ncbi-vdb's VDB-3418.sra test fixture, 985
spots / ~36 Mbp in ~4 s release). Platform-agnostic; long-read and
short-read aligned archives both work. Split / compression / stdout
flags and parallel decode (-t N) all go through the existing FASTQ
writer.
vdbcache-aware cSRA reader: CsraCursor::open_any routes each
sub-cursor (AlignmentCursor, ReferenceCursor) to whichever archive
carries its table. sracha fetch downloads the .sra.vdbcache
sidecar alongside the main .sra whenever SDL advertises one.
Narrowed reject_if_csra: the iter-4 rule rejected any archive
with aligned schema + CMP_BASE_COUNT > 0 + no unaligned marker.
Those archives still carry a full physical READ column in practice
and decode cleanly through the plain VdbCursor path; validated on
9 of the 10 past-rejected archives from prior random-corpus runs
(DRR017176, DRR027259, DRR027597, DRR032355, DRR040407, DRR040559,
DRR041303, DRR045227, DRR045255, DRR045332).
validation/random_corpus.sh --aligned: targets WGS /
BAM-loaded accessions via the ENA portal, passed through to
sample_accessions.sh.
Actionable errors for known-unsupported cSRA shapes: external
refseq fetch (REFERENCE without embedded CMP_READ; SRR341578-class)
and fixed-length SEQUENCE without physical READ_LEN both surface
clear "decode with fasterq-dump for now" messages instead of opaque
column header (idx1) not found diagnostics.
Fixed
spots_before race across BATCH_SIZE=1024 boundaries: the decode
loop used to read spots_read atomically into per-batch cumulative
offsets, racing with the writer thread across the bounded channel.
Archives with > 1024 blobs (e.g. DRR045255) silently reset the FASTQ
defline spot number to 1 at the 1,048,577th spot. Now tracked
locally in the decode loop using blob metadata only.
page_map random-access offset unit: variable-length integer
columns with row_length > 1 sometimes carry u32-indexed data_runs
(rather than entry-indexed). Adaptive dispatch tries entry-index
first and falls back to u32-index when the max offset would overflow
the decoded buffer. Unblocks DRR045255's READ_LEN blob at row ~1 M.
0.2.0 (2026-04-18)
Added
MD5 verification by default: Downloads verify MD5 against SDL-reported
hashes, decoded blobs verify per-blob MD5 and CRC32, and spot counts are
cross-checked against RunInfo. Use fetch --no-validate to skip.
sracha validate --md5 <HASH> / --offline: Check a file against an
explicit MD5 or skip the SDL lookup for air-gapped use.
Local SRA files in sracha info: Pass a .sra file path (including
~/...) to print the table of contents, schema, and metadata without
hitting NCBI.
Resolution spinners: get, fetch, and info show progress while
resolving projects and accessions.
Changed
Silent decode corruption: CRC32/MD5 mismatches and truncated
variable-length columns now abort with an error instead of producing
partial rows.
Download resume hardening: Range requests validate Content-Range and
track expected MD5 in .sracha-progress, catching servers that ignore
ranges or files replaced mid-transfer.
Verbosity defaults: Default log level hides INFO; use -v for INFO,
-vv for DEBUG, -vvv for TRACE.
Fixed
CRC32 computation: Per-blob CRC32 validation used the standard
CRC-32/ISO-HDLC (crc32fast) and disagreed with the variant emitted by
ncbi-vdb (MSB-first polynomial 0x04C11DB7, init=0, no reflection, no
final XOR). Previously the mismatch was swallowed; now that it's an
error, decode would have spuriously rejected real SRA files. Replaced
with a conforming implementation.
Aligned SRA / cSRA hang: Extended cSRA rejection to cover the
bam-load-style variant — files with a physical SEQUENCE/col/READ
column but an NCBI:align:db:... schema that synthesizes
READ_LEN/READ_TYPE through ncbi-vdb's schema-aware virtual cursor
(e.g. SRR14724462). Without that cursor the decode fell through to
fixed-length heuristics and wedged the pipeline. The existing
CMP_READ/PRIMARY_ALIGNMENT path and the new schema-based path now
return one unified UnsupportedFormat error pointing to
fasterq-dump. A matching "Not yet supported" entry was added to the
docs.
0.1.10 (2026-04-16)
Added
Completion markers: get writes .sracha-done markers so a second
invocation with the same output skips re-download and re-decode.
--format sra|sralite: Select full SRA or SRA-lite encoding via the
SDL capability parameter.
Changed
CLI reorganization: Commands and flags grouped semantically under
help headings for clearer --help output.
Strict flag validation: Contradictory CLI flag combinations now error
out instead of silently picking one.
Fixed
Ctrl-C cleanup in stdout mode: Interrupting -Z streaming now
deletes the temp SRA file and prints the correct cancellation message.
Version string: Release builds between tags now include the git SHA.
fastq / get help text: Clarify accession wording in fastq
subcommand; mention -Z in get docs.
0.1.9 (2026-04-16)
Added
Stdout streaming: New -Z flag streams FASTQ output to stdout for
piping into downstream tools. (#7)
75 new tests: Unit and integration tests covering previously untested
modules.
Acknowledgments: Added acknowledgments for NCBI and SRA Toolkit team.
Alignment docs page: New documentation page covering alignment topics.
Changed
VDB metadata read structure: Read structure (count, lengths, platform)
is now derived from VDB table metadata, making the EUtils RunInfo fetch
optional and improving reliability for accessions with missing RunInfo.
Tabled output: info and validate commands now use tabled for
formatted table output.
Remove dead --format flag: Removed unused --format argument; wired
up --no-resume for the get command.
Fixed
Interleaved output routing: Fixed a bug in interleaved split mode
output routing and corrected the min_read_len test.
0.1.8 (2026-04-15)
Changed
Project downloads require confirmation: Downloads from project accessions
(SRP/ERP/DRP/PRJNA/PRJEB/PRJDB) now always require --yes / -y to proceed,
preventing surprise multi-hundred-GiB downloads. The info table is shown for
all project downloads so users can review what they're about to download.
Lower size confirmation threshold: The size gate for non-project downloads
was lowered from 500 GiB to 100 GiB.
Added
Disk space check: Downloads now check available disk space in the target
directory before starting and bail with a clear error if there isn't enough
room.
0.1.7 (2026-04-15)
Fixed
PacBio sequence accuracy: Replace quality-based N-masking with ALTREAD
4na ambiguity merge, matching the VDB schema's bit_or(2na, .ALTREAD)
derivation. PacBio SRR38107137 drops from 680 to 0 sequence mismatches and
9,324 to 0 quality mismatches vs fasterq-dump. Illumina output remains
byte-identical. Closes #4.
0.1.6 (2026-04-15)
Added
Dev version strings: Non-release builds now show git SHA and dirty flag
(e.g. 0.1.6-dev+abc1234.dirty) via a build script.
cSRA rejection: Detect aligned SRA (cSRA) archives and return an
actionable error pointing users to fasterq-dump.
Changed
Benchmarks: Updated README benchmarks to 8-core results with v0.1.5.
Integration tests: Switched from LS454 fixture (SRR000001) to Illumina
(SRR28588231) after adding legacy platform rejection.
Fixed
Clippy: Fixed collapsible-if and manual-contains warnings from Rust 1.94.
PacBio quality decode: Expand page map data_runs for variable-length rows.
0.1.5 (2026-04-14)
Added
Benchmarks: Added validation/benchmark.sh script comparing sracha
against fastq-dump and fasterq-dump, and added benchmark results to README.
Graceful Ctrl-C handling: The get command now cancels in-flight
downloads cleanly on SIGINT.
Changed
Progress bars: Switched to Unicode thin-bar style and extracted shared
progress bar helper.
MIT license: Added LICENSE file.
Fixed
Cursor tests: Fixed temp file name collision in parallel cursor tests.
0.1.4 (2026-04-14)
Performance
Gzip backpressure: ParGzWriter now blocks when too many blocks are
pending, preventing the decode loop from outrunning compression. Eliminates
a multi-second finish() stall and reduces overall decode+gzip time by ~47%
(19s to 10s on SRR000001).
0.1.3 (2026-04-14)
Performance
Thread-local compressor reuse: Gzip compression reuses libdeflater
Compressor and output buffer across blocks via thread-local storage,
avoiding ~300 KiB malloc/free per 256 KiB block.
Cap gzip thread pool: Compression pool threads are now capped at
available_parallelism() to prevent oversubscription.
Lazy quality fallback buffer: The lite quality buffer is only allocated
when quality data is actually missing, skipping ~300 KiB per blob in the
common case.
Inline izip type 0 reads: Eliminated intermediate Vec<i64> allocations
in izip decode by reading packed values directly from raw buffers during
output reconstruction.
Zero-copy blob data: DecodedBlob now borrows data directly from
mmap'd slices via Cow<'a, [u8]>, eliminating ~9% of heap allocations.
Multi-accession download prefetch: When processing multiple accessions,
the next file's download starts while the current one is being decoded,
overlapping network and CPU.
Changed
Added profiling cargo profile (optimized, no LTO) for heap profiling
with valgrind/dhat.
Fixed
Illumina tile boundaries: Fixed skey id2ord delta unpacking to use
big-endian bitstream order matching ncbi-vdb's Unpack function. Tile
assignments at spot boundaries are now correct. Also fixed span_bits
header offset for v2 index files. Closes #3.
Per-spot template selection: Name templates are now looked up per spot
(not per blob), so tile transitions within a blob produce correct deflines.
Fixed spot length for v1 blobs: When READ_LEN is absent, the v1 blob
header row_length is now used as a fallback for fixed spot length detection,
enabling correct spot splitting without API access.
irzip v3 dual-series decoding: Implemented the series_count=2 path for
irzip decompression, fixing X/Y coordinate decoding for blobs that use
interleaved dual-series delta encoding.
X/Y page map expansion: X and Y column values are now expanded via
page map data runs, matching the existing READ_LEN expansion logic.
0.1.2 (2026-04-14)
Added
Direct S3 fetch: Downloads now probe the NCBI SRA Open Data S3 bucket
directly, skipping the SDL API round-trip. Falls back to SDL automatically
when the direct URL is unavailable (old/non-public accessions). Stable URLs
also improve resume reliability vs. expiring presigned SDL URLs. Use
--prefer-sdl to opt out.
Changed
Simplify KAR/VDB parsing: Unified duplicated PBSTree parsers across
kar.rs and metadata.rs into a single shared implementation. Removed dead
code (unused metadata children parsing, leftover debug logging), eliminated
unnecessary temporary allocations in idx2 block decoding, and moved test-only
functions (unpack, read_blob_for_row) behind #[cfg(test)]. Net reduction
of ~220 lines with identical output.
Batch API calls for info and get: Multi-accession and project queries
now resolve all runs in 2 HTTP requests (1 SDL + 1 EUtils) instead of 2N
sequential calls. Significantly faster for projects with many runs.
Improved error messages: Not-found accessions now include an NCBI search
link to help verify the accession exists.
0.1.1 (2026-04-13)
Added
FASTA output mode: --fasta flag on fastq and get commands outputs
>defline\nsequence\n records instead of FASTQ. Skips quality column decode
entirely for faster conversion when quality scores are not needed.
zstd compression: --zstd flag on fastq and get commands uses zstd
compression instead of gzip. Native multi-threaded compression via the zstd
crate. Configurable level with --zstd-level (1-22, default 3). Produces
.fastq.zst or .fasta.zst output files.
validate subcommand: sracha validate <file.sra> verifies SRA file
integrity by opening the KAR archive, parsing the SEQUENCE table, and
decoding all blobs in parallel without producing output. Reports columns
found, spot/blob counts, and any decode errors. Exits with code 1 on failure.
Resume interrupted downloads: Downloads now resume automatically.
Completed files are skipped (verified by size + MD5). Parallel chunked
downloads track progress in a .sracha-progress sidecar file; on retry,
only incomplete chunks are re-downloaded. Single-stream downloads resume
via HTTP Range. Use --no-resume to force a fresh download.
Changed
Compression is now configured via a CompressionMode enum (None, Gzip,
Zstd) instead of separate --gzip / --no-gzip boolean flags. Existing
flag behavior is preserved: gzip is the default, --no-gzip disables
compression, --zstd selects zstd.
sracha get temp downloads now preserve partial files on failure for
automatic resume on the next attempt.
0.1.0 (2026-04-13)
Added
Project-level accessions: sracha get PRJNA675068 and sracha get SRP123456
resolve study/BioProject accessions to constituent runs via NCBI EUtils API.
Accession list input: --accession-list flag on get, fetch, and info
reads accessions from a file (one per line, # comments supported).
Illumina name reconstruction: Deflines now include the original Illumina
read name (instrument:run:flowcell:lane:tile:X:Y) reconstructed from the
skey index and physical X/Y columns.
Fixed
Quality string corruption: Fixed three bugs that could produce invalid
FASTQ quality strings causing STAR alignment failures:
ASCII quality heuristic now validates all bytes, not just the first 100.
Quality offset tracking always advances in the fallback path.
format_read validates quality length matches sequence and sanitizes
invalid bytes (outside Phred+33 range [33, 126]).
N base handling: Bases with quality <= Phred 2 are now emitted as N,
matching the NCBI convention for Illumina no-call bases in 2na encoding.
Defline format: Output now matches fasterq-dump format
(@RUN.SPOT_NUM DESCRIPTION length=LEN) with the + line repeating the
full defline.