Hacker News new | ask | show | jobs
by novas0x2a 5552 days ago
One thing to check: what is your current locale? GNU grep is much, much slower when it's multibyte-aware, even if you're searching for an ascii string. Try repeating your test with LC_ALL=C grep (don't forget to take the filecache into account).

You can check the current values of your locale with `locale`.

2 comments

Nice! That fixed it. Now it's 3 seconds.

I'm still surprised why it's so much slower with UTF-8, though. I guess gnu grep is naively converting back and forth between representations? There's nothing in UTF-8 that should prevent it from doing this efficiently. Even with complex patterns, it possible to search through the file in about the same time as a simple pattern. E.g. aspell can build a FSA where each transition is O(1), making the search time more or less independent of the search pattern.

I assume the problem is that you can't skip ahead X characters on multibyte data without parsing the bytes, because you don't know how big a character is. So you may have to read every byte.
But you can skip ahead without parsing all the bytes since UTF-8 is self-synchronizing. The only scenario I can envision is that gnu grep wants to perform unicode normalization, to catch equivalent codepoints, and that this is implemented inefficiently.

Edit: Granted, since I'm using -c, it has to look at all the bytes to find all the newlines.

That's also the case for sort. If you often sort huge ASCII files, doing it in LC_ALL=C is a huge win.