| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jstimpfle 3782 days ago
	My parent was speaking about indexing at the code points level, not at the encoding (byte / character) level. I do know that Unicode has combining code points (confusingly called combining characters) and nasty things like rtl switching code points. I guess it's turtles all the way down.

2 comments

vardump 3782 days ago

> My parent was speaking about indexing at the code points level, not at the encoding (byte / character) level.

You need UTF-32 for (random) indexing of code points. UTF-16 has 16-bit code units. Some UTF-16 code points are 32-bits, using a surrogate pair.

So it's the same trade-off as with UTF-8. Thus no reason not to just simply use UTF-8 in the first place and take advantage of the memory savings.

link

jstimpfle 3782 days ago

Again, my original parent's statement was not about encoding or memory savings. The statement was that it was a bad idea to index into an (abstract) unicode string (of unicode code points -- not compositions thereof whatsoever).

I didn't question that, but hoped to get some inspiration for sane usage of unicode handling (which I'm not sure is humanly possible except for treating it as a rather black box and make no promises).

link

imron 3782 days ago

Your original parent was all about encodings, and mentioned it was a bad idea to arbitrarily index in to utf8 strings, (no mention of abstract strings of unicode codepoints).

> languages such as Rust gain efficiency by working with unmodified UTF-8. All you lose is constant-time arbitrary indexing

So it's saying Rust mostly benefits from using utf8, but in doing so, it loses the ability to arbitrarily index a character in a string (in constant time).

If it was abstract strings of unicode codepoints then there is no problem - except you'd then be using 32bits per codepoint.

link

imron 3782 days ago

Actually, they are not combining code points. Take for example the character 𪚥 (4 dragons).

The codepoint is U+2A6A5, but in UTF16 it requires combining 2 utf16 characters (\uD869 and \uDEA5) in order to reference it.

The codepoint however is still exactly the same (U+2A6A5).

link

vardump 3781 days ago

> The codepoint is U+2A6A5, but in UTF16 it requires combining 2 utf16 characters (\uD869 and \uDEA5) in order to reference it.

No, you mean two UTF-16 code units. A character is one or more code points.

link