| well if you're looking for certain motifs or k-mers (https://en.wikipedia.org/wiki/K-mer), it's mostly scanning through pre-processed FASTA files. My last K-mer counter I wrote was CPU/Memory bound (big enough size strings means big hash tables to store this stuff in, or using something like a bloom filter (Jellyfish's approach), not IO bound. Anyway, for anyone who wants to write something fun, fasta parsers / k-mer counters are a good weekend project. here was my work in the area (shameless plug, though i'm not in the field anymore) https://github.com/mutantturkey/dna-utils. just tested again against a 400mbish genome fasta file from ncbi and it's still clearly an cpu/memory issue with huge hash tables. See below (4^x is the potential combinations of K-mers). K=9 is done with a 4^9 array (sane in memory), 4^12 is a sparse array (stored as a a std::unordered_map, because it's too large to allocate that much space in memory). One is obviously way slower calvin@bison:~/src/dna-utils$ time ./kmer_total_count -k 9 -i 3816_ref_Abrus_2018_chrUn.fa.2 >/dev/null real 0m4.235s user 0m4.116s sys 0m0.108s calvin@bison:~/src/dna-utils$ time ./kmer_total_count -k 12 -i 3816_ref_Abrus_2018_chrUn.fa.2 >/dev/null real 0m25.524s user 0m25.292s sys 0m0.152s calvin@bison:~/src/dna-utils |