| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by vardump 3779 days ago

> My parent was speaking about indexing at the code points level, not at the encoding (byte / character) level.

You need UTF-32 for (random) indexing of code points. UTF-16 has 16-bit code units. Some UTF-16 code points are 32-bits, using a surrogate pair.

So it's the same trade-off as with UTF-8. Thus no reason not to just simply use UTF-8 in the first place and take advantage of the memory savings.

1 comments

jstimpfle 3779 days ago

Again, my original parent's statement was not about encoding or memory savings. The statement was that it was a bad idea to index into an (abstract) unicode string (of unicode code points -- not compositions thereof whatsoever).

I didn't question that, but hoped to get some inspiration for sane usage of unicode handling (which I'm not sure is humanly possible except for treating it as a rather black box and make no promises).

imron 3779 days ago

Your original parent was all about encodings, and mentioned it was a bad idea to arbitrarily index in to utf8 strings, (no mention of abstract strings of unicode codepoints).

> languages such as Rust gain efficiency by working with unmodified UTF-8. All you lose is constant-time arbitrary indexing

So it's saying Rust mostly benefits from using utf8, but in doing so, it loses the ability to arbitrarily index a character in a string (in constant time).

If it was abstract strings of unicode codepoints then there is no problem - except you'd then be using 32bits per codepoint.