Hacker News new | ask | show | jobs
by matklad 2233 days ago
> It shouldn't be as cheap as just iterating over a sequence of 32-bit integers.

I wonder if there are any benchmarks about this? Specifically, it feels like in theory iterating utf8 could actually be faster if the data is mostly ascii, as that would require less memory bandwidth, and it seems like the computation is simple enough for memory to be the bottleneck (this is a wild guess, I have horrible intuition about speed of various hardware things). In this particular benchmark this reasoning doesn’t apply, as strings are short and should just fit in cache.

2 comments

If all you need to do is validate UTF-8, then yes, mostly ASCII enables some nice fast paths[1].

I'm not a UTF-8 decoding specialist, but if you need to traverse rune-by-rune via an API as general as `str::chars`, then you need to do some kind of work to convert your bytes into runes. Usually this involves some kind of branching.

But no, I haven't benchmarked it. Just intuition. A better researched response to your comment would benchmark, and would probably at least do some research on whether Daniel Lemire's work[2] would be applicable. (Or, in general, whether SIMD could be used to batch the UTF-8 decoding process.)

[1] - https://github.com/BurntSushi/bstr/blob/91edb3fb3e1ef347b30e...

[2] - https://lemire.me/blog/2018/05/16/validating-utf-8-strings-u...

Did a quick benchmark: https://gist.github.com/matklad/ebc1acd2fab884f3ff9d96d4175f....

Seems like hypothesis is not wrong, but also is not interesting: the difference is pretty small, and it can be easily dwarfed by the big wins for utf32 if something like auto-vectorization or bounds-check elision kicks in. Also I am not entirely sure that the diff I observe is not due to something completely irrelevant, like one loop ending up on a lucky cacheline or what not.