Hacker News new | ask | show | jobs
by devit 2245 days ago
The main problem is that allocating a vector for each evaluation is completely wrong: instead, it needs to be allocated once by making the function a method on a struct containing a Vec; which makes the allocator moot.

The second problem is that at least the Rust code is decoding UTF-8 every iteration of the inner loop instead of decoding once and saving the result, or even better interning the characters and having versions of the inner loop for 32-bit chars and 8-bit and 16-bit interned indexes.

Furthermore the code rereads cache[j] instead of storing the previous value, and doesn't do anything to make sure that bound checks are elided in the inner loop (although perhaps the compiler can optimize that).

The code for computing the min seems to have been written mindlessly rather than putting serious thought towards whether to have branches or not and in what order (depending on an analysis of what the branch directions rates would be).

Implausible benchmark results are almost always an indicator of the incompetence of the person performing the benchmark.

3 comments

> The second problem is that at least the Rust code is decoding UTF-8 every iteration of the inner loop instead of decoding once and saving the result

Indeed. This is a pretty damning difference. The `target` string is being repeatedly UTF-8 decoded where as the same is not true in the Go version. The Go version even goes out of its way to do UTF-8 decoding exactly once for each of `source` and `target`, but then doesn't do the same for the Rust program.

> Implausible benchmark results are almost always an indicator of the incompetence of the person performing the benchmark.

Come on. We can do better than this. Please don't make it personal. We all have to learn things at some point.

> Indeed. This is a pretty damning difference. The `target` string is being repeatedly UTF-8 decoded where as the same is not true in the Go version. The Go version even goes out of its way to do UTF-8 decoding exactly once for each of `source` and `target`, but then doesn't do the same for the Rust program.

I'm really not sure that's an issue, utf8 decoding is very, very cheap and it's iterating either way.

It would have to be benched, but I wouldn't be surprised if allocating the caches (at least one allocation per line of input) had way more overhead, especially given the inputs are so very short.

I'm not going to claim Rust's utf8 decoder is the fastest around, but it's very fast.

It's cheap, but not _that_ cheap. It shouldn't be as cheap as just iterating over a sequence of 32-bit integers.

But yes, I did benchmark this, even after reusing allocations, and I can't tell a difference. The benchmark is fairly noisy.

I agree with your conclusion, especially after looking at the input[1]. The strings are so small that the overhead of caching the UTF-8 decoding is probably comparable to the cost of doing UTF-8 decoding.

[1] - https://github.com/christianscott/levenshtein-distance-bench...

> It shouldn't be as cheap as just iterating over a sequence of 32-bit integers.

I wonder if there are any benchmarks about this? Specifically, it feels like in theory iterating utf8 could actually be faster if the data is mostly ascii, as that would require less memory bandwidth, and it seems like the computation is simple enough for memory to be the bottleneck (this is a wild guess, I have horrible intuition about speed of various hardware things). In this particular benchmark this reasoning doesn’t apply, as strings are short and should just fit in cache.

If all you need to do is validate UTF-8, then yes, mostly ASCII enables some nice fast paths[1].

I'm not a UTF-8 decoding specialist, but if you need to traverse rune-by-rune via an API as general as `str::chars`, then you need to do some kind of work to convert your bytes into runes. Usually this involves some kind of branching.

But no, I haven't benchmarked it. Just intuition. A better researched response to your comment would benchmark, and would probably at least do some research on whether Daniel Lemire's work[2] would be applicable. (Or, in general, whether SIMD could be used to batch the UTF-8 decoding process.)

[1] - https://github.com/BurntSushi/bstr/blob/91edb3fb3e1ef347b30e...

[2] - https://lemire.me/blog/2018/05/16/validating-utf-8-strings-u...

Did a quick benchmark: https://gist.github.com/matklad/ebc1acd2fab884f3ff9d96d4175f....

Seems like hypothesis is not wrong, but also is not interesting: the difference is pretty small, and it can be easily dwarfed by the big wins for utf32 if something like auto-vectorization or bounds-check elision kicks in. Also I am not entirely sure that the diff I observe is not due to something completely irrelevant, like one loop ending up on a lucky cacheline or what not.

> The second problem is that at least the Rust code is decoding UTF-8 every iteration of the inner loop instead of decoding once and saving the result, or even better interning the characters and having versions of the inner loop for 32-bit chars and 8-bit and 16-bit interned indexes.

I tried this. Pulling the .chars() call out of the loop & collecting into a Vec made the performance even worse – the following balloons the runtime from ~2.7s to ~5s:

    let target_chars: Vec<char> = target.chars().collect();
    for (i, source_char) in source.chars().enumerate() {
        let mut next_dist = i + 1;

        for (j, target_char) in target_chars.iter().enumerate()  {
> written mindlessly > incompetence of the person

No challenge there :P I am operating under the assumption that I don't need to understand how compilers work to get good performance from rust (where good is "similar enough to an equivalent go program")

Indeed, it looks like doing the UTF-8 decoding up-front is exacerbating the performance difference in the allocator.

I think this is where the GP's first suggestion comes into play. If one were writing this code _and_ cared about performance, then you'd usually find a way to reuse allocations. I submitted a PR to demonstrate this: https://github.com/christianscott/levenshtein-distance-bench...

Yeah, I can definitely see how that would be a more performant approach.

I suppose I wasn't so interested in figuring out how to make this algorithm as fast as possible as much I was interested in diving into why this particular implementation was slower.

I'm not totally convinced that this difference is down to the string being parsed over and over, though

> doing the UTF-8 decoding up-front is exacerbating the performance difference in the allocator

This seems to suggest that allocation might be dominating here. WDYT? Either way, I've added a disclaimer to the post.

> Implausible benchmark results are almost always an indicator of the incompetence of the person performing the benchmark.

An ad hominem attack surely isn't needed.

And that's really too bad because I think the rest of the comment is spot-on as the fundamental issue is that Rust simply delegates to the system allocator, and MacOS's allocator is pretty slow.

As a result, while Rust allows very explicitly and relatively easily removing allocations (compared to C or C++), getting the most performances out of your program also requires doing so, unless you use a non-system allocator with better support for the workload.

But you can use a custom allocator in Rust: https://doc.rust-lang.org/1.9.0/book/custom-allocators.html

You don’t have that option in Go

> But you can use a custom allocator in Rust: https://doc.rust-lang.org/1.9.0/book/custom-allocators.html

That is true — and was demonstrated by the article as its "fix" was to use jemalloc, but even a custom allocator will usually be less performant than a high-performance GC there, because the GC's allocator has more insight into the requirements and workloads.

It might be possible to give that insight to a custom allocator by using the allocator's custom APIs, but this requires a deeper integration between the program and the allocator.

(those are very old docs, you want to link to https://doc.rust-lang.org/stable/std/alloc/trait.GlobalAlloc...)
> You don’t have that option in Go

Sure you do. You can even build one from nothing but mmap in pure Go. It just won't be part of the garbage collection, so you get malloc/free or arenas etc, just like in C/Rust/whatever.