Hacker News new | ask | show | jobs
by eis 2233 days ago
Yep.

Here a Go playground example showing that the result is indeed wrong:

https://play.golang.org/p/vmctMFUevPc

It should output 3 but outputs 5 because each ö is two bytes, len("föö") = 5.

I would suggest using "range" to iterate over the unicode characters.

2 comments

If you diff föö and f though, it correctly gives an edit distance of 2.

The code is weird because someone knew enough to convert the strings to slices of runes but not enough to use the rune slices consistently. :-/

Not to mention Rune slices are insufficient for things like Flag emoji and Family emoji, which is going to be a ton of separate runes put together. The latter of which, apparently deletes one family member at a time when you hit "backspace".
Oh fab, just when I thought I had a fairly solid understanding of how to handle Unicode strings I learn something else that increases the complexity.

I have nothing but respect and gratitude for people that write good unicode handling libraries, but even then the end developer has to learn a lot just to be aware of what to look out for when handling strings.

Somewhere on github I think, somebody has posted a file with evil Unicode strings.

In general, unicode requires you think differently about strings depending on context. Here's my rule of thumb.

1. If you are transporting a unicode string, reading/writing over the network or to a file, think in terms of UTF-8 bytes. Do not attempt to splice the string, treat it as an atomic unit.

2. If you are parsing a string, think in terms of code points (runes in Go, chars in Rust). A good example would be the Servo CSS parser. [1]

3. If you're comparing/searching/inspecting/sorting a string in code, segment by grapheme clusters and normalize, then do what you came to do. [2]

4. If you're displaying a string, think in terms of pixels. Do not attempt to limit a string by length in "characters" (nee grapheme clusters in the unicode world) but rather measure by what the renderer does with the string. Each character can be a thoroughly arbitrary width and height.

5. If you're building a WYSIWYG editor, there's more to it than I even know myself, but I suggest reading into what Xi did. It's going to be some combination of everything above. [3]

[1] https://github.com/servo/rust-cssparser/blob/master/src/toke...

[2] https://github.com/unicode-rs/unicode-segmentation

[3] https://github.com/xi-editor/xi-editor

> 2. If you are parsing a string, think in terms of code points (runes in Go, chars in Rust). A good example would be the Servo CSS parser. [1]

If all your syntactically meaningful characters are in ASCII you can also use UTF-8 bytes in your parser.

Even if they aren't, no UTF-8 encoding of a character is a substring of the encoding of any other character(s).

Looks like a small bug in the go code, corrected here. Original author should have used rune throughout. https://play.golang.org/p/mGZZMFtMgHH