| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by eis 2280 days ago

Yep.

Here a Go playground example showing that the result is indeed wrong:

https://play.golang.org/p/vmctMFUevPc

It should output 3 but outputs 5 because each ö is two bytes, len("föö") = 5.

I would suggest using "range" to iterate over the unicode characters.

2 comments

earthboundkid 2280 days ago

If you diff föö and f though, it correctly gives an edit distance of 2.

The code is weird because someone knew enough to convert the strings to slices of runes but not enough to use the rune slices consistently. :-/

link

arcticbull 2280 days ago

Not to mention Rune slices are insufficient for things like Flag emoji and Family emoji, which is going to be a ton of separate runes put together. The latter of which, apparently deletes one family member at a time when you hit "backspace".

link

bigbizisverywyz 2279 days ago

Oh fab, just when I thought I had a fairly solid understanding of how to handle Unicode strings I learn something else that increases the complexity.

I have nothing but respect and gratitude for people that write good unicode handling libraries, but even then the end developer has to learn a lot just to be aware of what to look out for when handling strings.

Somewhere on github I think, somebody has posted a file with evil Unicode strings.

link

arcticbull 2279 days ago

In general, unicode requires you think differently about strings depending on context. Here's my rule of thumb.

1. If you are transporting a unicode string, reading/writing over the network or to a file, think in terms of UTF-8 bytes. Do not attempt to splice the string, treat it as an atomic unit.

2. If you are parsing a string, think in terms of code points (runes in Go, chars in Rust). A good example would be the Servo CSS parser. [1]

3. If you're comparing/searching/inspecting/sorting a string in code, segment by grapheme clusters and normalize, then do what you came to do. [2]

4. If you're displaying a string, think in terms of pixels. Do not attempt to limit a string by length in "characters" (nee grapheme clusters in the unicode world) but rather measure by what the renderer does with the string. Each character can be a thoroughly arbitrary width and height.

5. If you're building a WYSIWYG editor, there's more to it than I even know myself, but I suggest reading into what Xi did. It's going to be some combination of everything above. [3]

[1] https://github.com/servo/rust-cssparser/blob/master/src/toke...

[2] https://github.com/unicode-rs/unicode-segmentation

[3] https://github.com/xi-editor/xi-editor

link

account42 2277 days ago

> 2. If you are parsing a string, think in terms of code points (runes in Go, chars in Rust). A good example would be the Servo CSS parser. [1]

If all your syntactically meaningful characters are in ASCII you can also use UTF-8 bytes in your parser.

Even if they aren't, no UTF-8 encoding of a character is a substring of the encoding of any other character(s).

link

savaki 2280 days ago

Looks like a small bug in the go code, corrected here. Original author should have used rune throughout. https://play.golang.org/p/mGZZMFtMgHH

link