Zstandard's --long Mode: A Genome Compression Breakthrough
2025-09-15
Zstandard's --long range match finder significantly improves compression for large files by increasing the search window. Testing on a 2.6Tbp dataset of 661,405 bacterial genomes showed default Zstandard achieving a compression ratio of only 3. Enabling --long mode modestly improved this to 4. However, removing newlines from the FASTA files dramatically boosted the ratio to 31, approaching the performance of specialized DNA compressors, reducing the file size to 80GB. While compression time increased slightly, this efficiency gain represents a valuable optimization for handling large genomic datasets.
Read more
Tech
genome compression