|
|
|
|
|
by epistasis
3874 days ago
|
|
That sounds quite feasible, though it hasn't really been worth the effort until we have quite a few more genomes. And typically extra information about the variant (is it in a gene, does it change a protein, etc.) so that extra lookups aren't required during a scan. There are typically 4-6 million variants discovered through this method of genome sequencing in a normal genome. A simple variant consists of a genome coordinate at ~32 bits (one of 3.2e9), and the change from the reference, which is a x,y index into {A, C, G, T}^2, at ~4 bits. Typically the coordinates are spaced on average ~1k bases apart, so the coordinate could probably be squeezed into ~15bits with clever encoding. So a naive encoding of this information gets to 27MB, and that could probably be shoved down into 10MB if coordinates are deltas from the previous, rather than absolute. 1MB seems feasible, but with diminishing returns computationally. |
|