| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lyle_nel 3459 days ago

I have a small cluster of machines that I run experiments on. GNU parallel makes the dispatch of jobs on remote machines very easy.

In addition, I often use it to search for sequences by running grep in parallel. For example

$ parallel 'grep {1} -f haystack.txt' :::: many_needles.txt

Where {1} is a single line in many_needles.txt

3 comments

bane 3458 days ago

If you find yourself searching lots of haystacks, and your needles are just text and not a regex, a better approach is to stuff all the needles into some kind of index, then chop up the haystack into overlapping tiles (of variable width from the smallest needle to the largest), then search each tile against the index of needles. This effectively searches all the needles at once and turns the operation from O(n) where n is the number of needles to O(m) where m is the number of tiles in haystack.txt.

It may seem to be a trivial difference, but then you can search multiple haystacks at once fairly easily, and this approach scales to hundreds of millions of needles at once. The code for it isn't very difficult either, heck you can just use an in memory SQLite dB to get a searchable, temporary, index and rely on using some of the most tested software in history.

link

agentgt 3458 days ago

This also works for sorting as well and is typically called Radix Sort or Bucket Sort.

Basically using unique attributes of the data you then divide and conquer on those attributes (e.g. for Radix you make buckets based on the digits).

link

justinsaccount 3459 days ago

Unless those patterns are regexes, you should just be using

    $ fgrep -f many_needles.txt haystack.txt

link

nothrabannosir 3459 days ago

How much faster is a plain text search really than an regexp without special characters? You'd think this would be quite easy to optimise for a regexp engine.

I admit I try to use -f all the time but your post suddenly made me realise I'd never actually measured the effect. :/

Edit: yes sorry I meant -F

link

burntsushi 3459 days ago

There are a few things to clear up here. Namely, fgrep (equivalent to grep -F) is orthogonal to -f. fgrep/`grep -F` is for "fixed string" search, where as -f is for reading multiple patterns from a file. So `fgrep -f file` means "do a simple literal (non-regexp) search for each string in `file`."

There are two possible reasons why one might want to explicitly use fgrep/`grep -F`: 1) to avoid needing to escape a string that otherwise contains special regex characters and 2) because it permits the search tool to avoid the regexp engine entirely.

(1) is always a valid reason and is actually quite useful because escaping regexes can be a bother. But whether (2) is valid or not depends on whether your search tool is smart enough to recognize a simple literal search and automatically avoid the regex engine. Another layer to this is, of course, whether the regex engine itself is smart enough to handle this case for you automatically. Whether these optimizations are actually applied or not is difficult for a casual user to know. I don't actually know of any tool that doesn't optimize the simplest case (when no special regex features are required and it's just a simple literal search), so it seems to me that one should never use fgrep/`grep -F` for performance reasons alone.

However, if you use the `-f` flag, then you've asked the tool to do multiple string search. Perhaps in this case, the search tool doesn't try as hard to do simple literal optimizations. Indeed, I can actually witness evidence in favor of this guess. The first command takes 15s and the second command takes 10s:

    LC_ALL=C grep -c -f queries /tmp/OpenSubtitles2016.raw.en
    LC_ALL=C grep -c -F -f queries /tmp/OpenSubtitles2016.raw.en

The contents of `queries`:

    $ cat queries 
    Sherlock Holmes
    John Watson
    Professor Moriarty
    Irene Adler

grep in this case is GNU grep 2.26. The size of /tmp/OpenSubtitles2016.raw.en is 9.3GB. The only difference between the commands is the presence of the -F switch in the second command. My /tmp is a ramdisk, so the file was already in memory and therefore isn't benchmarking the speed of my disk. The corpus can be downloaded here (warning, multiple GB): http://opus.lingfil.uu.se/OpenSubtitles2016/mono/OpenSubtitl...

Interestingly, performing a similar test using BSD grep shows no differences in the execution time, which suggests BSD grep isn't doing anything smart even when it knows it has only literals (and I say this because BSD grep is outrageously slow).

As a small plug, ripgrep is four times faster than GNU grep on this test and has no difference whether you pass -F or not.

(This is only scratching the surface of literal optimizations that a search tool can do. For example, a good search tool will search for `foo` when matching the regex `\w+foo\d+` before ever entering the regex engine itself.)

link

recuter 3458 days ago

This is a great comment and ripgrep deserves a more prominent plug, clickable: https://github.com/BurntSushi/ripgrep

link

tedunangst 3458 days ago

BSD grep decides on a pattern by pattern basis which match engine to use. -F is unlikely to affect performance.

link

burntsushi 3458 days ago

Oh dear, you appear to be correct. Adding additional queries to `queries` (while being careful not to increase total match count by much) appears to increase search time linearly. From that, it looks like BSD grep is just running an additional pass over each line for each query.

link

lgessler 3458 days ago

(Sorry, this is mildly off-topic.) Not sure if this fits your usecase, but you should check out codesearch if you haven't already: https://github.com/google/codesearch

(Russ Cox's excellent writeup is here: https://swtch.com/~rsc/regexp/regexp4.html)

link