| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by novas0x2a 5601 days ago
	One thing to check: what is your current locale? GNU grep is much, much slower when it's multibyte-aware, even if you're searching for an ascii string. Try repeating your test with LC_ALL=C grep (don't forget to take the filecache into account). You can check the current values of your locale with `locale`.

2 comments

gorset 5600 days ago

Nice! That fixed it. Now it's 3 seconds.

I'm still surprised why it's so much slower with UTF-8, though. I guess gnu grep is naively converting back and forth between representations? There's nothing in UTF-8 that should prevent it from doing this efficiently. Even with complex patterns, it possible to search through the file in about the same time as a simple pattern. E.g. aspell can build a FSA where each transition is O(1), making the search time more or less independent of the search pattern.

link

pkteison 5600 days ago

I assume the problem is that you can't skip ahead X characters on multibyte data without parsing the bytes, because you don't know how big a character is. So you may have to read every byte.

link

gorset 5600 days ago

But you can skip ahead without parsing all the bytes since UTF-8 is self-synchronizing. The only scenario I can envision is that gnu grep wants to perform unicode normalization, to catch equivalent codepoints, and that this is implemented inefficiently.

Edit: Granted, since I'm using -c, it has to look at all the bytes to find all the newlines.

link

ssp 5601 days ago

That's also the case for sort. If you often sort huge ASCII files, doing it in LC_ALL=C is a huge win.

link