|
|
|
|
|
by BKPetkov
3880 days ago
|
|
Interesting - wouldn't you need to have access to "the rest" of the genomes that you are comparing against? In other words, wouldn't you need to keep that ~100 GB from the newly sequenced genome in temporary storage while comparing against the rest of the database stored somewhere in the cloud, before then condensing the new genome into a variant file? |
|
Pretty much all genomic data uses a reference genome as the basis. This is versioned, and has a bug tracker, etc., for various regions that have been difficult to assemble.
The flow is:
1. BCL (scans of the glass slide) 2. FASTQ (individual short reads and quality scores, unsorted and in random order) 3. BAM (individual short reads aligned to the reference genome) 4. VCF (the "diff" vs. the reference genome)
All of this can be done with <10GB of reference data and code, where the reference data is the current human genome, a burrows-wheeler transform of the human genome, gene locations, and dbSNP (the database of common human variation).