well if you're looking for certain motifs or k-mers (https://en.wikipedia.org/wiki/K-mer), it's mostly scanning through pre-processed FASTA files.
My last K-mer counter I wrote was CPU/Memory bound (big enough size strings means big hash tables to store this stuff in, or using something like a bloom filter (Jellyfish's approach), not IO bound.
Anyway, for anyone who wants to write something fun, fasta parsers / k-mer counters are a good weekend project.
here was my work in the area (shameless plug, though i'm not in the field anymore)
just tested again against a 400mbish genome fasta file from ncbi and it's still clearly an cpu/memory issue with huge hash tables. See below (4^x is the potential combinations of K-mers). K=9 is done with a 4^9 array (sane in memory), 4^12 is a sparse array (stored as a a std::unordered_map, because it's too large to allocate that much space in memory). One is obviously way slower
calvin@bison:~/src/dna-utils$ time ./kmer_total_count -k 9 -i 3816_ref_Abrus_2018_chrUn.fa.2 >/dev/null
real 0m4.235s
user 0m4.116s
sys 0m0.108s
calvin@bison:~/src/dna-utils$ time ./kmer_total_count -k 12 -i 3816_ref_Abrus_2018_chrUn.fa.2 >/dev/null
My last K-mer counter I wrote was CPU/Memory bound (big enough size strings means big hash tables to store this stuff in, or using something like a bloom filter (Jellyfish's approach), not IO bound.
Anyway, for anyone who wants to write something fun, fasta parsers / k-mer counters are a good weekend project.
here was my work in the area (shameless plug, though i'm not in the field anymore)
https://github.com/mutantturkey/dna-utils.
just tested again against a 400mbish genome fasta file from ncbi and it's still clearly an cpu/memory issue with huge hash tables. See below (4^x is the potential combinations of K-mers). K=9 is done with a 4^9 array (sane in memory), 4^12 is a sparse array (stored as a a std::unordered_map, because it's too large to allocate that much space in memory). One is obviously way slower
calvin@bison:~/src/dna-utils$ time ./kmer_total_count -k 9 -i 3816_ref_Abrus_2018_chrUn.fa.2 >/dev/null
real 0m4.235s
user 0m4.116s
sys 0m0.108s
calvin@bison:~/src/dna-utils$ time ./kmer_total_count -k 12 -i 3816_ref_Abrus_2018_chrUn.fa.2 >/dev/null
real 0m25.524s
user 0m25.292s
sys 0m0.152s
calvin@bison:~/src/dna-utils