| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lcdoutlet 2198 days ago

I am a big fan of ripgrep. I think the tool is great for quick simple searches. On larger data sets with sufficiently complicated regex patterns the performance gap is small. I have also encountered situations using rg on med-large datasets where the files may contain Unicode, regular expressions, random garbage... and rg failed. I don't recall a situation like this where the error handling in grep didn't recover and move on to the next file.

Here is an example showing the performance gap to be small using (enwiki-20200401-pages-articles-multistream.xml 72G) split into 24 files on a ramdisk. I ran this 50 times and the results are similar.

du -sch . 72G . 72G total

time ls|xargs -i -P24 rg -c "<.?>" {} real 0m6.841s user 2m8.766s sys 0m5.637s

time ls|xargs -i -P24 grep -Ec "<.?>" {} real 0m6.168s user 1m35.058s sys 0m24.177s

1 comments

burntsushi 2198 days ago

> I have also encountered situations using rg on med-large datasets where the files may contain Unicode, regular expressions, random garbage... and rg failed.

I would really love to have a bug report if you could find a way to reproduce this. ripgrep should certainly not fail regardless of the input. I've actually spent a lot of time trying to make sure that ripgrep's error handling and so forth generally match GNU grep's behavior.

> Here is an example showing the performance gap to be small using (enwiki-20200401-pages-articles-multistream.xml 72G) split into 24 files on a ramdisk. I ran this 50 times and the results are similar.

Could you show me where to get `enwiki-20200401-pages-articles-multistream.xml`? I'd love to run the benchmark myself, although I don't have a ramdisk quite that big. :P I could shrink the data set though a bit.

If I were to guess though, yeah, I'd generally expect GNU grep and ripgrep to perform similarly here. Speculating:

If the match count of those runs is very high, then total search time is likely dominated by per-match overhead. Both ripgrep and GNU grep should be pretty close there, but there's not too much room to improve drastically.

If the match count of those runs isn't high or is very small, then GNU grep and ripgrep will use a search algorithm that is virtually identical in its performance characteristics. They will both probably look for occurrences of `>` using `memchr`, and then run a full search on matching lines. It's hard to get much faster than that, so again, not a lot of room for differentiating yourself.

If I can get my hands on the corpus, then I'm pretty sure I could give you (non-pathological) patterns on which searches would show very different performance characteristics between GNU grep and ripgrep.

link

lcdoutlet 2198 days ago

Thank you for your reply. Again, I am big fan of your work specfically rg and fst.

I apologize for mentioning this without submitting bug reports. I will do so in the near future.

The dataset can be found here. https://dumps.wikimedia.org/enwiki/20200420/enwiki-20200420-...

In general when I start a search, the patterns are somewhat pathological. For example when learning about a new codebase I might start with 10-100 regex and 100+ keywords. With each iteration the complexity is reduced until I find the most relevant parts of the codebase.

I know rg performs significantly better than grep out of the box. I think grep by default is compiled without optimizations and does not use concurrency. I would be interested in comparing the performance characteristics between the tools in more detail.

link

Jasper_ 2198 days ago

I'm not sure 0401 is still around, but here's 0420, which is the oldest I could find. Should be a similar corpus.

https://dumps.wikimedia.your.org/enwiki/20200420/

link