| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by calvinmorrison 2555 days ago

well if you're looking for certain motifs or k-mers (https://en.wikipedia.org/wiki/K-mer), it's mostly scanning through pre-processed FASTA files.

My last K-mer counter I wrote was CPU/Memory bound (big enough size strings means big hash tables to store this stuff in, or using something like a bloom filter (Jellyfish's approach), not IO bound.

Anyway, for anyone who wants to write something fun, fasta parsers / k-mer counters are a good weekend project.

here was my work in the area (shameless plug, though i'm not in the field anymore)

https://github.com/mutantturkey/dna-utils.

just tested again against a 400mbish genome fasta file from ncbi and it's still clearly an cpu/memory issue with huge hash tables. See below (4^x is the potential combinations of K-mers). K=9 is done with a 4^9 array (sane in memory), 4^12 is a sparse array (stored as a a std::unordered_map, because it's too large to allocate that much space in memory). One is obviously way slower

calvin@bison:~/src/dna-utils$ time ./kmer_total_count -k 9 -i 3816_ref_Abrus_2018_chrUn.fa.2 >/dev/null

real 0m4.235s

user 0m4.116s

sys 0m0.108s

calvin@bison:~/src/dna-utils$ time ./kmer_total_count -k 12 -i 3816_ref_Abrus_2018_chrUn.fa.2 >/dev/null

real 0m25.524s

user 0m25.292s

sys 0m0.152s

calvin@bison:~/src/dna-utils