Hacker News new | ask | show | jobs
by imron 3780 days ago
Both utf8 and utf16 can contain multicharacter elements. If you split a string at an arbitrary point you risk splitting it inside a multicharacter element.

This will be very common in utf8 that contains non-ascii characters, and very rare with utf16 (only happens with characters outside the BMP).

Neither is something you want in your code, unless you think it's a good idea to corrupt your users' data.

Edit: It's not too difficult to handle these cases and make sure you only split at valid positions, but you do need to be careful and there are a number of edge cases you might not think through or even encounter unless you have the right sort of data to test with - which leads to lots of faulty implementations. e.g. for years MySQL couldn't handle utf8 characters outside the BMP.

1 comments

My parent was speaking about indexing at the code points level, not at the encoding (byte / character) level.

I do know that Unicode has combining code points (confusingly called combining characters) and nasty things like rtl switching code points. I guess it's turtles all the way down.

> My parent was speaking about indexing at the code points level, not at the encoding (byte / character) level.

You need UTF-32 for (random) indexing of code points. UTF-16 has 16-bit code units. Some UTF-16 code points are 32-bits, using a surrogate pair.

So it's the same trade-off as with UTF-8. Thus no reason not to just simply use UTF-8 in the first place and take advantage of the memory savings.

Again, my original parent's statement was not about encoding or memory savings. The statement was that it was a bad idea to index into an (abstract) unicode string (of unicode code points -- not compositions thereof whatsoever).

I didn't question that, but hoped to get some inspiration for sane usage of unicode handling (which I'm not sure is humanly possible except for treating it as a rather black box and make no promises).

Your original parent was all about encodings, and mentioned it was a bad idea to arbitrarily index in to utf8 strings, (no mention of abstract strings of unicode codepoints).

> languages such as Rust gain efficiency by working with unmodified UTF-8. All you lose is constant-time arbitrary indexing

So it's saying Rust mostly benefits from using utf8, but in doing so, it loses the ability to arbitrarily index a character in a string (in constant time).

If it was abstract strings of unicode codepoints then there is no problem - except you'd then be using 32bits per codepoint.

Actually, they are not combining code points. Take for example the character 𪚥 (4 dragons).

The codepoint is U+2A6A5, but in UTF16 it requires combining 2 utf16 characters (\uD869 and \uDEA5) in order to reference it.

The codepoint however is still exactly the same (U+2A6A5).

> The codepoint is U+2A6A5, but in UTF16 it requires combining 2 utf16 characters (\uD869 and \uDEA5) in order to reference it.

No, you mean two UTF-16 code units. A character is one or more code points.