Hacker News new | ask | show | jobs
by epistasis 3874 days ago
That sounds quite feasible, though it hasn't really been worth the effort until we have quite a few more genomes. And typically extra information about the variant (is it in a gene, does it change a protein, etc.) so that extra lookups aren't required during a scan.

There are typically 4-6 million variants discovered through this method of genome sequencing in a normal genome. A simple variant consists of a genome coordinate at ~32 bits (one of 3.2e9), and the change from the reference, which is a x,y index into {A, C, G, T}^2, at ~4 bits. Typically the coordinates are spaced on average ~1k bases apart, so the coordinate could probably be squeezed into ~15bits with clever encoding. So a naive encoding of this information gets to 27MB, and that could probably be shoved down into 10MB if coordinates are deltas from the previous, rather than absolute. 1MB seems feasible, but with diminishing returns computationally.

1 comments

Interesting - wouldn't you need to have access to "the rest" of the genomes that you are comparing against? In other words, wouldn't you need to keep that ~100 GB from the newly sequenced genome in temporary storage while comparing against the rest of the database stored somewhere in the cloud, before then condensing the new genome into a variant file?
Typically we don't look at other genomes while we find the variants in an individual genome. Each genome is analyzed against the "reference" human genome, which is an average of 10 individuals. This forms the coordinate basis that is shared for everyone else.

Pretty much all genomic data uses a reference genome as the basis. This is versioned, and has a bug tracker, etc., for various regions that have been difficult to assemble.

The flow is:

1. BCL (scans of the glass slide) 2. FASTQ (individual short reads and quality scores, unsorted and in random order) 3. BAM (individual short reads aligned to the reference genome) 4. VCF (the "diff" vs. the reference genome)

All of this can be done with <10GB of reference data and code, where the reference data is the current human genome, a burrows-wheeler transform of the human genome, gene locations, and dbSNP (the database of common human variation).