Hacker News new | ask | show | jobs
by kxc42 1257 days ago
That can be a challenge, but you can also build an "artificial" reference genome. You just use it for compression, not for any real analyses. This would allow you to still use alignment-based compression.

But I agree with you: it really depends on the type of the data.

1 comments

It would be nice also that the artificial reference represented global population structure- for example, the larger the genetic distance between an individual who is sequenced, and the identity of the person who makes up the reference (an amalgam of several individuals from a common US population), the less compression you get. Instead, it seems like you could create the "genome that is the shortest distance to all other genomes" (a centroid of cluster centroids) and then the standard deviation of your compressed sizes should be much smaller.
Well I think the issue with wastewater and other screening tech is that there is no global average reference genome. In that case they're sequencing everything from phages, viruses (human and plant), bacteria, fungi, plants/animals and human...its an everything soup.
oh. From what I can tell, the total world storage for non-human genome data is trivially small (a few petabytes and not growing rapidly). Human is huge- O(petabytes)/year for a single org is not out of the question.
Thats true, but we do tremendous amounts of human DNA sequencing for certain causes at scale(e.g. understanding/treating cancer) whereas environmental sequencing is usually done to monitor/search for things at a much lower sample rate(e.g. disease load in wastewater, biodiversity from environmental samples, and looking for natural products produced by the zillions of bacteria/archaea in the oceans). From e.g. a wastewater sample perspective the latter type is going to be the majority of data, we just filter out the stuff of interest and analyze it in situ - but theres no reason to store 1B E coli genomes whereas this is necessary if we want to understand cancer evolution.
If you want to use untargeted metagenomics to detect novel human viruses you're going to be generating petabytes all by yourself: https://arxiv.org/pdf/2108.02678.pdf
I can't see any reason why you would need to save petabytes. Remember- at that scale, people think really hard about whether to pay the long-term storage and associated costs (the value of having this system should exceed its costs). The case for this already exists in (for example) cancer and other pharma.
The storage is massively cheaper than the sequencing. At some point it could be worth going back and trying to figure out how much of the raw data you can safely discarded, but at least at first there are so many more other things that are more urgent.

(The paper I linked describes more or less what I'm currently working on)