|
|
|
|
|
by cvoss
1036 days ago
|
|
It's a slight exaggeration of the information content to report the data size using an ASCII encoding. Since there are 4 bases, each can be encoded using 2 bits, rather than 8. So we're really talking 750 megabytes. But still mind-blowing. |
|
Or, you could build an embedding with far fewer parameters that could explain the vast majority of phenotypic differences. the genome is a hierarchical palimpsest of low entropy.
My standard interview question- because I hate leetcode- walks the interviewee through compressing DNA using bit encoding, then using that to implement a rolling hash to do fast frequency counting. Some folks get stuck at "how many bits in a byte", others at "if you have 4 symbols, how many bits are required to encode a symbol?", and other candidates jump straight to bloom filters and other probabilistic approaches (https://github.com/bcgsc/ntHash and https://github.com/dib-lab/khmer are good places to start if you are interested).