| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by epistasis 3878 days ago

Typically we don't look at other genomes while we find the variants in an individual genome. Each genome is analyzed against the "reference" human genome, which is an average of 10 individuals. This forms the coordinate basis that is shared for everyone else.

Pretty much all genomic data uses a reference genome as the basis. This is versioned, and has a bug tracker, etc., for various regions that have been difficult to assemble.

The flow is:

1. BCL (scans of the glass slide) 2. FASTQ (individual short reads and quality scores, unsorted and in random order) 3. BAM (individual short reads aligned to the reference genome) 4. VCF (the "diff" vs. the reference genome)

All of this can be done with <10GB of reference data and code, where the reference data is the current human genome, a burrows-wheeler transform of the human genome, gene locations, and dbSNP (the database of common human variation).