| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kxc42 1256 days ago

I'm not sure why gzip still pops up for FASTQ data, as it is quite easy to bin the quality scores, align it against a reference genome and compress it as e.g. CRAM [1,2].

With 8 bins, the variant calling accuraccy seems to be preserved, while drastically reducing the file size.

[1]: https://en.wikipedia.org/wiki/CRAM_%28file_format%29

[2]: https://lh3.github.io/2020/05/25/format-quality-binning-and-...

2 comments

jefftk 1256 days ago

You don't necessarily have a reference genome to align to. For example, I've recently been working with wastewater metagenomics where (a) the sample consists of a very large number of organisms and (b) we don't have reference genomes for most of these organisms anyway.

link

kxc42 1256 days ago

That can be a challenge, but you can also build an "artificial" reference genome. You just use it for compression, not for any real analyses. This would allow you to still use alignment-based compression.

But I agree with you: it really depends on the type of the data.

link

dekhn 1256 days ago

It would be nice also that the artificial reference represented global population structure- for example, the larger the genetic distance between an individual who is sequenced, and the identity of the person who makes up the reference (an amalgam of several individuals from a common US population), the less compression you get. Instead, it seems like you could create the "genome that is the shortest distance to all other genomes" (a centroid of cluster centroids) and then the standard deviation of your compressed sizes should be much smaller.

link

v8xi 1256 days ago

Well I think the issue with wastewater and other screening tech is that there is no global average reference genome. In that case they're sequencing everything from phages, viruses (human and plant), bacteria, fungi, plants/animals and human...its an everything soup.

link

dekhn 1256 days ago

oh. From what I can tell, the total world storage for non-human genome data is trivially small (a few petabytes and not growing rapidly). Human is huge- O(petabytes)/year for a single org is not out of the question.

link

v8xi 1256 days ago

Thats true, but we do tremendous amounts of human DNA sequencing for certain causes at scale(e.g. understanding/treating cancer) whereas environmental sequencing is usually done to monitor/search for things at a much lower sample rate(e.g. disease load in wastewater, biodiversity from environmental samples, and looking for natural products produced by the zillions of bacteria/archaea in the oceans). From e.g. a wastewater sample perspective the latter type is going to be the majority of data, we just filter out the stuff of interest and analyze it in situ - but theres no reason to store 1B E coli genomes whereas this is necessary if we want to understand cancer evolution.

link

jefftk 1256 days ago

If you want to use untargeted metagenomics to detect novel human viruses you're going to be generating petabytes all by yourself: https://arxiv.org/pdf/2108.02678.pdf

link

asdff 1256 days ago

It might be because some popular bioinformatic tools support using gzipped data directly

link