Hacker News new | ask | show | jobs
by dekhn 1031 days ago
bzip2 is marginally better, and then genome-specific compressors were developed, and then finally, people started storing individual genomes as diffs from a single reference, https://en.wikipedia.org/wiki/CRAM_(file_format)

Since genome files contain more data than just ATGC (typically a comment line, then a DNA line, then a quality score line), and each of those draws from a different distribution, DEFLATE on a FASTA file doesn't reach the full potential of the compressor because the huffman table ends up having to hold all three distributions, and the dictionary backlookups aren't as efficient either. It turns out you can split the file into multiple streams, one per line type, and then compress those independently, with slightly better compression ratios, but it's still not great.