| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by josephg 1600 days ago

Handling unicode can be fine, depending on what you're doing. The hard parts are:

- Counting, rendering and collapsing grapheme clusters (like the flag emoji)

- Converting between legacy encodings (shiftjis, ko8, etc) and UTF-8 / UTF-16

- Canonicalization

If all you need is to deal with utf8 byte buffers, you don't need all that stuff. And your code can stay simple, small and fast.

IIRC the rust standard library doesn't bother supporting any of the hard parts in unicode. The only real unicode support in std is utf8 validation for strings. All the complex aspects of unicode are delegated to 3rd party crates.

By contrast, nodejs (and web browsers) do all of this. But they implement it in the same way you're suggesting - they simply call out to libicu.

1 comments

tialaramex 1600 days ago

> The only real unicode support in std is utf8 validation for strings.

Rust's core library gives char methods such as is_numeric which asks whether this Unicode codepoint is in one of Unicode's numeric classes such as the letter-like-numerics and various digits. (Rust does provide char with is_ascii_digit and is_ascii_hexdigit if that's all you actually cared about)

So yes, the Rust standard library is carrying around the entire Unicode standard class rule list among other things, of course Rust's library isn't built to a vast binary, so if you never use these features your binary doesn't get that code.