Hacker News new | ask | show | jobs
by arcticbull 2235 days ago
Not to mention Rune slices are insufficient for things like Flag emoji and Family emoji, which is going to be a ton of separate runes put together. The latter of which, apparently deletes one family member at a time when you hit "backspace".
1 comments

Oh fab, just when I thought I had a fairly solid understanding of how to handle Unicode strings I learn something else that increases the complexity.

I have nothing but respect and gratitude for people that write good unicode handling libraries, but even then the end developer has to learn a lot just to be aware of what to look out for when handling strings.

Somewhere on github I think, somebody has posted a file with evil Unicode strings.

In general, unicode requires you think differently about strings depending on context. Here's my rule of thumb.

1. If you are transporting a unicode string, reading/writing over the network or to a file, think in terms of UTF-8 bytes. Do not attempt to splice the string, treat it as an atomic unit.

2. If you are parsing a string, think in terms of code points (runes in Go, chars in Rust). A good example would be the Servo CSS parser. [1]

3. If you're comparing/searching/inspecting/sorting a string in code, segment by grapheme clusters and normalize, then do what you came to do. [2]

4. If you're displaying a string, think in terms of pixels. Do not attempt to limit a string by length in "characters" (nee grapheme clusters in the unicode world) but rather measure by what the renderer does with the string. Each character can be a thoroughly arbitrary width and height.

5. If you're building a WYSIWYG editor, there's more to it than I even know myself, but I suggest reading into what Xi did. It's going to be some combination of everything above. [3]

[1] https://github.com/servo/rust-cssparser/blob/master/src/toke...

[2] https://github.com/unicode-rs/unicode-segmentation

[3] https://github.com/xi-editor/xi-editor

> 2. If you are parsing a string, think in terms of code points (runes in Go, chars in Rust). A good example would be the Servo CSS parser. [1]

If all your syntactically meaningful characters are in ASCII you can also use UTF-8 bytes in your parser.

Even if they aren't, no UTF-8 encoding of a character is a substring of the encoding of any other character(s).