Hacker News new | ask | show | jobs
by A_Beer_Clinked 3873 days ago
I found this link: https://medium.com/precision-medicine/how-big-is-the-human-g... In summary: > 1. In a perfect world (just your 3 billion letters): ~700 megabytes > 2. In the real world, right off the genome sequencer: ~200 gigabytes > 3. As a variant file, with just the list of mutations: ~125 megabytes
2 comments

If there is only .1% variation then we should be able to get a diff down to ~1MB with some cleverness.
That sounds quite feasible, though it hasn't really been worth the effort until we have quite a few more genomes. And typically extra information about the variant (is it in a gene, does it change a protein, etc.) so that extra lookups aren't required during a scan.

There are typically 4-6 million variants discovered through this method of genome sequencing in a normal genome. A simple variant consists of a genome coordinate at ~32 bits (one of 3.2e9), and the change from the reference, which is a x,y index into {A, C, G, T}^2, at ~4 bits. Typically the coordinates are spaced on average ~1k bases apart, so the coordinate could probably be squeezed into ~15bits with clever encoding. So a naive encoding of this information gets to 27MB, and that could probably be shoved down into 10MB if coordinates are deltas from the previous, rather than absolute. 1MB seems feasible, but with diminishing returns computationally.

Interesting - wouldn't you need to have access to "the rest" of the genomes that you are comparing against? In other words, wouldn't you need to keep that ~100 GB from the newly sequenced genome in temporary storage while comparing against the rest of the database stored somewhere in the cloud, before then condensing the new genome into a variant file?
Typically we don't look at other genomes while we find the variants in an individual genome. Each genome is analyzed against the "reference" human genome, which is an average of 10 individuals. This forms the coordinate basis that is shared for everyone else.

Pretty much all genomic data uses a reference genome as the basis. This is versioned, and has a bug tracker, etc., for various regions that have been difficult to assemble.

The flow is:

1. BCL (scans of the glass slide) 2. FASTQ (individual short reads and quality scores, unsorted and in random order) 3. BAM (individual short reads aligned to the reference genome) 4. VCF (the "diff" vs. the reference genome)

All of this can be done with <10GB of reference data and code, where the reference data is the current human genome, a burrows-wheeler transform of the human genome, gene locations, and dbSNP (the database of common human variation).

How long does it take (and with what computational bandwidth) to produce a 125MB variant file from 200GB raw sequence data?
Depending on the pipeline you use and the compute resources available you could have a full workflow done in anywhere from several hours to a couple days. Illumina BaseSpace is free (for now) and has some example data sets with a bunch of canned pipelines for analysis if you're interested in trying it for yourself. https://basespace.illumina.com/
You're not going to VCF on a whole genome in several hours.
With particular hardware and software you can. Edico Dragen claims speeds for bcl -> vcf of 20 minutes [1]. With Microsoft Research's snap aligner and 450GB of memory you can get whole genome alignment in ~30 minutes and then variant calling can be done in a couple hours.

1. http://www.edicogenome.com/dragen/dragen-gp/

Could you please elaborate on this?
I've seen 200GB runs take 4 days, I've seen runs take 3 hours. Depends on your computing structure but more importantly is your IO. High CPU core counts and high speed storage access make a big difference, as does distributing the computational workload.