Hacker News new | ask | show | jobs
by BKPetkov 3880 days ago
Interesting - wouldn't you need to have access to "the rest" of the genomes that you are comparing against? In other words, wouldn't you need to keep that ~100 GB from the newly sequenced genome in temporary storage while comparing against the rest of the database stored somewhere in the cloud, before then condensing the new genome into a variant file?
1 comments

Typically we don't look at other genomes while we find the variants in an individual genome. Each genome is analyzed against the "reference" human genome, which is an average of 10 individuals. This forms the coordinate basis that is shared for everyone else.

Pretty much all genomic data uses a reference genome as the basis. This is versioned, and has a bug tracker, etc., for various regions that have been difficult to assemble.

The flow is:

1. BCL (scans of the glass slide) 2. FASTQ (individual short reads and quality scores, unsorted and in random order) 3. BAM (individual short reads aligned to the reference genome) 4. VCF (the "diff" vs. the reference genome)

All of this can be done with <10GB of reference data and code, where the reference data is the current human genome, a burrows-wheeler transform of the human genome, gene locations, and dbSNP (the database of common human variation).