|
|
|
|
|
by dekhn
1031 days ago
|
|
And since the data is highly redundant the 750MB can be compressed down even further using standard approaches (DEFLATE works well, it uses both huffman coding and dictionary backreferences). Or, you could build an embedding with far fewer parameters that could explain the vast majority of phenotypic differences. the genome is a hierarchical palimpsest of low entropy. My standard interview question- because I hate leetcode- walks the interviewee through compressing DNA using bit encoding, then using that to implement a rolling hash to do fast frequency counting. Some folks get stuck at "how many bits in a byte", others at "if you have 4 symbols, how many bits are required to encode a symbol?", and other candidates jump straight to bloom filters and other probabilistic approaches
(https://github.com/bcgsc/ntHash and https://github.com/dib-lab/khmer are good places to start if you are interested). |
|