Hacker News new | ask | show | jobs
by molf 2245 days ago
The Rust version uses `target.chars().count()` to initialise the cache, while the Go version counts up to `len(target)`. These are not equivalent: the Rust version counts Unicode code points, the Go version counts bytes.

I am confused by the implementations, although I have not spent any time testing them. Both versions contain a mix of code that counts bytes (`.len()` and `len(...)`) and Unicode code points (`chars()` and `[]rune(...)`). My guess is that the implementation might not work correctly for certain non-ASCII strings, but I have not verified this.

Of course, if only ASCII strings are valid as input for this implementation then both versions will be a lot faster if they exclusively operate on bytes instead.

5 comments

Yep.

Here a Go playground example showing that the result is indeed wrong:

https://play.golang.org/p/vmctMFUevPc

It should output 3 but outputs 5 because each ö is two bytes, len("föö") = 5.

I would suggest using "range" to iterate over the unicode characters.

If you diff föö and f though, it correctly gives an edit distance of 2.

The code is weird because someone knew enough to convert the strings to slices of runes but not enough to use the rune slices consistently. :-/

Not to mention Rune slices are insufficient for things like Flag emoji and Family emoji, which is going to be a ton of separate runes put together. The latter of which, apparently deletes one family member at a time when you hit "backspace".
Oh fab, just when I thought I had a fairly solid understanding of how to handle Unicode strings I learn something else that increases the complexity.

I have nothing but respect and gratitude for people that write good unicode handling libraries, but even then the end developer has to learn a lot just to be aware of what to look out for when handling strings.

Somewhere on github I think, somebody has posted a file with evil Unicode strings.

In general, unicode requires you think differently about strings depending on context. Here's my rule of thumb.

1. If you are transporting a unicode string, reading/writing over the network or to a file, think in terms of UTF-8 bytes. Do not attempt to splice the string, treat it as an atomic unit.

2. If you are parsing a string, think in terms of code points (runes in Go, chars in Rust). A good example would be the Servo CSS parser. [1]

3. If you're comparing/searching/inspecting/sorting a string in code, segment by grapheme clusters and normalize, then do what you came to do. [2]

4. If you're displaying a string, think in terms of pixels. Do not attempt to limit a string by length in "characters" (nee grapheme clusters in the unicode world) but rather measure by what the renderer does with the string. Each character can be a thoroughly arbitrary width and height.

5. If you're building a WYSIWYG editor, there's more to it than I even know myself, but I suggest reading into what Xi did. It's going to be some combination of everything above. [3]

[1] https://github.com/servo/rust-cssparser/blob/master/src/toke...

[2] https://github.com/unicode-rs/unicode-segmentation

[3] https://github.com/xi-editor/xi-editor

> 2. If you are parsing a string, think in terms of code points (runes in Go, chars in Rust). A good example would be the Servo CSS parser. [1]

If all your syntactically meaningful characters are in ASCII you can also use UTF-8 bytes in your parser.

Even if they aren't, no UTF-8 encoding of a character is a substring of the encoding of any other character(s).

Looks like a small bug in the go code, corrected here. Original author should have used rune throughout. https://play.golang.org/p/mGZZMFtMgHH
Yes, especially because that changes it from constant time (strings know their length in bytes) to linear time (counting chars means a loop through the text.)

I was a bit suspicious of the conclusion, but didn’t dig in myself. I imagine this would be a much larger source of difference.

huh, yeah if I switch the rust version to target.len() the execution time drops by more than 10%

edit: and if I switch to source.bytes().enumerate() it drops by 20% more

Go's version and the Rust version differ in yet more subtle ways. It appears that Go's "rune" type is a Code Point, but Rusts's "char" type is a Unicode Scalar Value, a subset of Code Point that excludes surrogate pairs. Both versions will not work with complex Unicode input unless you perform both segmentation by Grapheme Cluster [1] and utilize a consistent Normalization [2] when comparing clusters.

Unicode is hard, fams, and it's rare that anything that looks easy is actually what you want.

[1] https://unicode.org/reports/tr29/

[2] http://unicode.org/reports/tr15/

> Both versions will not work with complex Unicode input unless you perform both segmentation by Grapheme Cluster [1] and utilize a consistent Normalization [2] when comparing clusters

Doing this is quite easy from Rust.

There's no standard library API in Rust for either grapheme cluster segmentation or for normalization. You'd need a third-party crate, at which point it's less "easy in Rust" and more "someone did the hard work for you" because, its really not easy anywhere haha.
No, and there probably never be a libstd implementation of it, but there is a single crate that everybody uses: unicode-segmentation [0]

> You'd need a third-party crate, at which point it's less "easy in Rust" and more "someone did the hard work for you"

"Someone did the work for you" is true of all code that you did not write yourself, independently of whether that code is easy or hard to write, or whether it is in the standard library or not.

unicode-segmentation is pretty much the only library of its kind in Rust, is super easy to discover (google "Rust grapheme cluster", "Rust unicode segmentation", etc.), and using it is as easy as just typing "cargo add unicode-segmentation".

The library is maintained by a Rust core team member, a Rust standard library team member, is used by servo and firefox, and is the only unicode segmentation library that people use.

Since many programs don't need to do any kind of unicode segmentation, making it part of the standard library sounds like a bad idea. In particular, given that unicode is a moving standard, it would mean that people stuck on old Rust toolchains (e.g. LTS linux distros) cannot create binaries that do proper unicode segmentation, which does not make sense.

The underlying problem is that many programmers do not know that they need to do unicode segmentation in the first place. Moving this into the standard library does not fix that problem either.

[0]: https://crates.io/crates/unicode-segmentation

> Since many programs don't need to do any kind of unicode segmentation, making it part of the standard library sounds like a bad idea. In particular, given that unicode is a moving standard, it would mean that people stuck on old Rust toolchains (e.g. LTS linux distros) cannot create binaries that do proper unicode segmentation, which does not make sense.

That has nothing to do with it. You could still have a library that has the very latest Unicode standard support for those that need the very latest, and keep updating the stdlib one.

It does not make sense either to expect someone to use bleeding edge libraries from cargo yet use an old rustc compiler. They can easily update it if needed.

> It does not make sense either to expect someone to use bleeding edge libraries from cargo yet use an old rustc compiler.

Of course it does. Many software users are stuck on multiple-year-old toolchains for various reasons, yet these systems still need to be able to handle unicode properly.

> They can easily update it if needed.

No, they cannot. Many users are stuck in older windows versions, linux versions, LTS linux versions, etc. because of their organization or their clients requirements.

Telling a client that you can't develop an app for them because their 2 year old Ubuntu is too old is often not a very successful business model.

> and keep updating the stdlib one.

These updates would only apply to newer Rust toolchains, that many users cannot use. Unless you are suggesting the release of patch versions for the soon to be 100 old Rust toolchains in existence every time the unicode standard is updated.

This is too much trouble and work for little gain, given that one can still use a Rust 1.0 compiler to compile the latest version of the unicode-segmentation crate without problems.

Nice catch, thanks for pointing this out. I've updated the cache initialization to use `len(targetChars)` rather than `len(target)`:

    cache := make([]int, len(targetChars)+1)
    for i := 0; i < len(targetChars)+1; i++ {
        cache[i] = i
    }
AFAIK this makes them equivalent (fingers crossed). It seems to not have made much of a difference (-0.03s)
They might be equivalent (perhaps apart from other issues pointed out in other comments), but both implementations are still either

1) incorrect if UTF-8-strings are supposed to be valid input, or

2) very inefficient if only ASCII-strings are supposed to be valid input.