Hacker News new | ask | show | jobs
by jldugger 2546 days ago
But getting any of that data off disk will typically be the bottleneck.

(Also, you'd want BLAST for DNA instead of grep, but perhaps this was an intentional simplification)

1 comments

well if you're looking for certain motifs or k-mers (https://en.wikipedia.org/wiki/K-mer), it's mostly scanning through pre-processed FASTA files.

My last K-mer counter I wrote was CPU/Memory bound (big enough size strings means big hash tables to store this stuff in, or using something like a bloom filter (Jellyfish's approach), not IO bound.

Anyway, for anyone who wants to write something fun, fasta parsers / k-mer counters are a good weekend project.

here was my work in the area (shameless plug, though i'm not in the field anymore)

https://github.com/mutantturkey/dna-utils.

just tested again against a 400mbish genome fasta file from ncbi and it's still clearly an cpu/memory issue with huge hash tables. See below (4^x is the potential combinations of K-mers). K=9 is done with a 4^9 array (sane in memory), 4^12 is a sparse array (stored as a a std::unordered_map, because it's too large to allocate that much space in memory). One is obviously way slower

calvin@bison:~/src/dna-utils$ time ./kmer_total_count -k 9 -i 3816_ref_Abrus_2018_chrUn.fa.2 >/dev/null

real 0m4.235s

user 0m4.116s

sys 0m0.108s

calvin@bison:~/src/dna-utils$ time ./kmer_total_count -k 12 -i 3816_ref_Abrus_2018_chrUn.fa.2 >/dev/null

real 0m25.524s

user 0m25.292s

sys 0m0.152s

calvin@bison:~/src/dna-utils