Hacker News new | ask | show | jobs
by bkase 2546 days ago
One of the authors here:

Yes you are right, you should not replace grep for one off regexes due to the latency of the io operations. As far as I recall, memory throughput however is much higher on house than on CPUs (or at least this was true in 2012). Our intended use cases for this sort of a grep were cases where you are finding many needles in enormous haystacks such as: Looking for malicious machine code in executables that you download off of the internet (virus scanning), or searching for certain genetic sequences in DNA code. I believe we discussed this in our final presentation, but perhaps we left it out of the written report.

Yes we probably should have included the numbers to show why you wouldn't want to use this instead of grep for simple tasks. I am sorry we missed it.

5 comments

But getting any of that data off disk will typically be the bottleneck.

(Also, you'd want BLAST for DNA instead of grep, but perhaps this was an intentional simplification)

well if you're looking for certain motifs or k-mers (https://en.wikipedia.org/wiki/K-mer), it's mostly scanning through pre-processed FASTA files.

My last K-mer counter I wrote was CPU/Memory bound (big enough size strings means big hash tables to store this stuff in, or using something like a bloom filter (Jellyfish's approach), not IO bound.

Anyway, for anyone who wants to write something fun, fasta parsers / k-mer counters are a good weekend project.

here was my work in the area (shameless plug, though i'm not in the field anymore)

https://github.com/mutantturkey/dna-utils.

just tested again against a 400mbish genome fasta file from ncbi and it's still clearly an cpu/memory issue with huge hash tables. See below (4^x is the potential combinations of K-mers). K=9 is done with a 4^9 array (sane in memory), 4^12 is a sparse array (stored as a a std::unordered_map, because it's too large to allocate that much space in memory). One is obviously way slower

calvin@bison:~/src/dna-utils$ time ./kmer_total_count -k 9 -i 3816_ref_Abrus_2018_chrUn.fa.2 >/dev/null

real 0m4.235s

user 0m4.116s

sys 0m0.108s

calvin@bison:~/src/dna-utils$ time ./kmer_total_count -k 12 -i 3816_ref_Abrus_2018_chrUn.fa.2 >/dev/null

real 0m25.524s

user 0m25.292s

sys 0m0.152s

calvin@bison:~/src/dna-utils

I'm quite certain Windows Defender uses (or used, at some point) GPUs in the machine to accelerate malware pattern matching. I remember a coworker talking about a funny bug as a consequence when the graphics driver executable itself was infected.
There's an article on the announcement last year: https://arstechnica.com/gadgets/2018/04/intel-microsoft-to-u...
That is for their enterprise "Microsoft Defender Advanced Threat Protection" endpoint security product which scans live memory for any hidden malware not writing to disk. Which would otherwise be consuming far more CPU than the typical default windows security software. Which is why they needed GPU in the first place. But it's an interesting solution none-the-less.

https://www.microsoft.com/en-us/microsoft-365/windows/micros...

I’ve run into this for simple tasks on a GPU, like merging sorted lists. The massive parallelism can’t make up for the transfer costs unless the operations performed are expensive enough. The “roofline” model is usually the perspective used for this kind of accelerator application.
Reminded me of a 12 year old paper on file carving by Marziale, Richard and Roussev[0].

[0] https://www.dfrws.org/sites/default/files/session-files/pres...

I know in pretty nearly all the domains I've worked, discrete packets of data will use regexp directly.

If we are using something more 'grep'-like, it's for large quantities of data, and in many cases we could treat these as a stream. So as long as the bandwidth to the GPU is sufficient, a little stall at the beginning would represent an undue hardship, if the results come back faster.