|
|
|
|
|
by matklad
2233 days ago
|
|
> It shouldn't be as cheap as just iterating over a sequence of 32-bit integers. I wonder if there are any benchmarks about this? Specifically, it feels like in theory iterating utf8 could actually be faster if the data is mostly ascii, as that would require less memory bandwidth, and it seems like the computation is simple enough for memory to be the bottleneck (this is a wild guess, I have horrible intuition about speed of various hardware things). In this particular benchmark this reasoning doesn’t apply, as strings are short and should just fit in cache. |
|
I'm not a UTF-8 decoding specialist, but if you need to traverse rune-by-rune via an API as general as `str::chars`, then you need to do some kind of work to convert your bytes into runes. Usually this involves some kind of branching.
But no, I haven't benchmarked it. Just intuition. A better researched response to your comment would benchmark, and would probably at least do some research on whether Daniel Lemire's work[2] would be applicable. (Or, in general, whether SIMD could be used to batch the UTF-8 decoding process.)
[1] - https://github.com/BurntSushi/bstr/blob/91edb3fb3e1ef347b30e...
[2] - https://lemire.me/blog/2018/05/16/validating-utf-8-strings-u...