| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by masklinn 2244 days ago

> Indeed. This is a pretty damning difference. The `target` string is being repeatedly UTF-8 decoded where as the same is not true in the Go version. The Go version even goes out of its way to do UTF-8 decoding exactly once for each of `source` and `target`, but then doesn't do the same for the Rust program.

I'm really not sure that's an issue, utf8 decoding is very, very cheap and it's iterating either way.

It would have to be benched, but I wouldn't be surprised if allocating the caches (at least one allocation per line of input) had way more overhead, especially given the inputs are so very short.

I'm not going to claim Rust's utf8 decoder is the fastest around, but it's very fast.

1 comments

burntsushi 2244 days ago

It's cheap, but not _that_ cheap. It shouldn't be as cheap as just iterating over a sequence of 32-bit integers.

But yes, I did benchmark this, even after reusing allocations, and I can't tell a difference. The benchmark is fairly noisy.

I agree with your conclusion, especially after looking at the input[1]. The strings are so small that the overhead of caching the UTF-8 decoding is probably comparable to the cost of doing UTF-8 decoding.

[1] - https://github.com/christianscott/levenshtein-distance-bench...

link

matklad 2243 days ago

> It shouldn't be as cheap as just iterating over a sequence of 32-bit integers.

I wonder if there are any benchmarks about this? Specifically, it feels like in theory iterating utf8 could actually be faster if the data is mostly ascii, as that would require less memory bandwidth, and it seems like the computation is simple enough for memory to be the bottleneck (this is a wild guess, I have horrible intuition about speed of various hardware things). In this particular benchmark this reasoning doesn’t apply, as strings are short and should just fit in cache.

link

burntsushi 2243 days ago

If all you need to do is validate UTF-8, then yes, mostly ASCII enables some nice fast paths[1].

I'm not a UTF-8 decoding specialist, but if you need to traverse rune-by-rune via an API as general as `str::chars`, then you need to do some kind of work to convert your bytes into runes. Usually this involves some kind of branching.

But no, I haven't benchmarked it. Just intuition. A better researched response to your comment would benchmark, and would probably at least do some research on whether Daniel Lemire's work[2] would be applicable. (Or, in general, whether SIMD could be used to batch the UTF-8 decoding process.)

[1] - https://github.com/BurntSushi/bstr/blob/91edb3fb3e1ef347b30e...

[2] - https://lemire.me/blog/2018/05/16/validating-utf-8-strings-u...

link

matklad 2243 days ago

Did a quick benchmark: https://gist.github.com/matklad/ebc1acd2fab884f3ff9d96d4175f....

Seems like hypothesis is not wrong, but also is not interesting: the difference is pretty small, and it can be easily dwarfed by the big wins for utf32 if something like auto-vectorization or bounds-check elision kicks in. Also I am not entirely sure that the diff I observe is not due to something completely irrelevant, like one loop ending up on a lucky cacheline or what not.

link