Hacker News new | ask | show | jobs
by kxc42 1256 days ago
I'm not sure why gzip still pops up for FASTQ data, as it is quite easy to bin the quality scores, align it against a reference genome and compress it as e.g. CRAM [1,2].

With 8 bins, the variant calling accuraccy seems to be preserved, while drastically reducing the file size.

[1]: https://en.wikipedia.org/wiki/CRAM_%28file_format%29

[2]: https://lh3.github.io/2020/05/25/format-quality-binning-and-...

2 comments

You don't necessarily have a reference genome to align to. For example, I've recently been working with wastewater metagenomics where (a) the sample consists of a very large number of organisms and (b) we don't have reference genomes for most of these organisms anyway.
That can be a challenge, but you can also build an "artificial" reference genome. You just use it for compression, not for any real analyses. This would allow you to still use alignment-based compression.

But I agree with you: it really depends on the type of the data.

It would be nice also that the artificial reference represented global population structure- for example, the larger the genetic distance between an individual who is sequenced, and the identity of the person who makes up the reference (an amalgam of several individuals from a common US population), the less compression you get. Instead, it seems like you could create the "genome that is the shortest distance to all other genomes" (a centroid of cluster centroids) and then the standard deviation of your compressed sizes should be much smaller.
Well I think the issue with wastewater and other screening tech is that there is no global average reference genome. In that case they're sequencing everything from phages, viruses (human and plant), bacteria, fungi, plants/animals and human...its an everything soup.
oh. From what I can tell, the total world storage for non-human genome data is trivially small (a few petabytes and not growing rapidly). Human is huge- O(petabytes)/year for a single org is not out of the question.
Thats true, but we do tremendous amounts of human DNA sequencing for certain causes at scale(e.g. understanding/treating cancer) whereas environmental sequencing is usually done to monitor/search for things at a much lower sample rate(e.g. disease load in wastewater, biodiversity from environmental samples, and looking for natural products produced by the zillions of bacteria/archaea in the oceans). From e.g. a wastewater sample perspective the latter type is going to be the majority of data, we just filter out the stuff of interest and analyze it in situ - but theres no reason to store 1B E coli genomes whereas this is necessary if we want to understand cancer evolution.
If you want to use untargeted metagenomics to detect novel human viruses you're going to be generating petabytes all by yourself: https://arxiv.org/pdf/2108.02678.pdf
It might be because some popular bioinformatic tools support using gzipped data directly