| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by burntsushi 2198 days ago

> I have also encountered situations using rg on med-large datasets where the files may contain Unicode, regular expressions, random garbage... and rg failed.

I would really love to have a bug report if you could find a way to reproduce this. ripgrep should certainly not fail regardless of the input. I've actually spent a lot of time trying to make sure that ripgrep's error handling and so forth generally match GNU grep's behavior.

> Here is an example showing the performance gap to be small using (enwiki-20200401-pages-articles-multistream.xml 72G) split into 24 files on a ramdisk. I ran this 50 times and the results are similar.

Could you show me where to get `enwiki-20200401-pages-articles-multistream.xml`? I'd love to run the benchmark myself, although I don't have a ramdisk quite that big. :P I could shrink the data set though a bit.

If I were to guess though, yeah, I'd generally expect GNU grep and ripgrep to perform similarly here. Speculating:

If the match count of those runs is very high, then total search time is likely dominated by per-match overhead. Both ripgrep and GNU grep should be pretty close there, but there's not too much room to improve drastically.

If the match count of those runs isn't high or is very small, then GNU grep and ripgrep will use a search algorithm that is virtually identical in its performance characteristics. They will both probably look for occurrences of `>` using `memchr`, and then run a full search on matching lines. It's hard to get much faster than that, so again, not a lot of room for differentiating yourself.

If I can get my hands on the corpus, then I'm pretty sure I could give you (non-pathological) patterns on which searches would show very different performance characteristics between GNU grep and ripgrep.

2 comments

lcdoutlet 2198 days ago

Thank you for your reply. Again, I am big fan of your work specfically rg and fst.

I apologize for mentioning this without submitting bug reports. I will do so in the near future.

The dataset can be found here. https://dumps.wikimedia.org/enwiki/20200420/enwiki-20200420-...

In general when I start a search, the patterns are somewhat pathological. For example when learning about a new codebase I might start with 10-100 regex and 100+ keywords. With each iteration the complexity is reduced until I find the most relevant parts of the codebase.

I know rg performs significantly better than grep out of the box. I think grep by default is compiled without optimizations and does not use concurrency. I would be interested in comparing the performance characteristics between the tools in more detail.

link

Jasper_ 2198 days ago

I'm not sure 0401 is still around, but here's 0420, which is the oldest I could find. Should be a similar corpus.

https://dumps.wikimedia.your.org/enwiki/20200420/

link