| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by burntsushi 2239 days ago

It's cheap, but not _that_ cheap. It shouldn't be as cheap as just iterating over a sequence of 32-bit integers.

But yes, I did benchmark this, even after reusing allocations, and I can't tell a difference. The benchmark is fairly noisy.

I agree with your conclusion, especially after looking at the input[1]. The strings are so small that the overhead of caching the UTF-8 decoding is probably comparable to the cost of doing UTF-8 decoding.

[1] - https://github.com/christianscott/levenshtein-distance-bench...

1 comments

matklad 2239 days ago

> It shouldn't be as cheap as just iterating over a sequence of 32-bit integers.

I wonder if there are any benchmarks about this? Specifically, it feels like in theory iterating utf8 could actually be faster if the data is mostly ascii, as that would require less memory bandwidth, and it seems like the computation is simple enough for memory to be the bottleneck (this is a wild guess, I have horrible intuition about speed of various hardware things). In this particular benchmark this reasoning doesn’t apply, as strings are short and should just fit in cache.

link

burntsushi 2239 days ago

If all you need to do is validate UTF-8, then yes, mostly ASCII enables some nice fast paths[1].

I'm not a UTF-8 decoding specialist, but if you need to traverse rune-by-rune via an API as general as `str::chars`, then you need to do some kind of work to convert your bytes into runes. Usually this involves some kind of branching.

But no, I haven't benchmarked it. Just intuition. A better researched response to your comment would benchmark, and would probably at least do some research on whether Daniel Lemire's work[2] would be applicable. (Or, in general, whether SIMD could be used to batch the UTF-8 decoding process.)

[1] - https://github.com/BurntSushi/bstr/blob/91edb3fb3e1ef347b30e...

[2] - https://lemire.me/blog/2018/05/16/validating-utf-8-strings-u...

link

matklad 2239 days ago

Did a quick benchmark: https://gist.github.com/matklad/ebc1acd2fab884f3ff9d96d4175f....

Seems like hypothesis is not wrong, but also is not interesting: the difference is pretty small, and it can be easily dwarfed by the big wins for utf32 if something like auto-vectorization or bounds-check elision kicks in. Also I am not entirely sure that the diff I observe is not due to something completely irrelevant, like one loop ending up on a lucky cacheline or what not.

link