|
|
|
|
|
by tomjakubowski
1641 days ago
|
|
Human DNA contains roughly 3.2 billion nucleotides. A 3 GB string suggests an encoding with one byte per nucleotide. I'm curious: since there are only 4 bases in DNA, for genomic data, this seems rather inefficient. Is there any advantage in encoding the DNA with two bits per nucleotide? source for 3.2 billion: https://www.ncbi.nlm.nih.gov/books/NBK21134/#!po=0.485437 |
|
In practice BWT alignment based tools may use a forward-index and a mirror-index of the reversed genome string (not reverse complemented). This dual index approach is important for dealing with mismatches strings. There's a nice example explaining this for an older tool named Bowtie [2]
With a two bit encoding and both indices it isn't uncommon for a genome index to take up several GB of RAM. For example, BWA uses 2-3 GB for its index [3].
[1] https://en.wikipedia.org/wiki/FM-index [2] https://academic.oup.com/bioinformatics/article/25/14/1754/2... [3] https://academic.oup.com/bioinformatics/article/25/14/1754/2...
There are some great computational benefits using 2 bit encoding for the BWT