Hacker News new | ask | show | jobs
by gorset 5558 days ago
Nice! That fixed it. Now it's 3 seconds.

I'm still surprised why it's so much slower with UTF-8, though. I guess gnu grep is naively converting back and forth between representations? There's nothing in UTF-8 that should prevent it from doing this efficiently. Even with complex patterns, it possible to search through the file in about the same time as a simple pattern. E.g. aspell can build a FSA where each transition is O(1), making the search time more or less independent of the search pattern.

1 comments

I assume the problem is that you can't skip ahead X characters on multibyte data without parsing the bytes, because you don't know how big a character is. So you may have to read every byte.
But you can skip ahead without parsing all the bytes since UTF-8 is self-synchronizing. The only scenario I can envision is that gnu grep wants to perform unicode normalization, to catch equivalent codepoints, and that this is implemented inefficiently.

Edit: Granted, since I'm using -c, it has to look at all the bytes to find all the newlines.