| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Dylan16807 3826 days ago

You get people to accept the truth that characters have a variable length in bytes.

Then you offer a data structure that lets you perform O(1) or O(logn) operations on sequences of single-character strings.

If it's read-only you could make it just be an index, blah blah the details don't matter a lot, the point is you can make something that's both correct to grapheme clusters and probably more space-efficient than UTF-32 despite the extra data.

And then the encoding inside the character strings isn't particularly important, but might as well use UTF-8.

Either that or make yourself a hilariously inefficient format based on:

UAX15-D3. Stream-Safe Text Format: A Unicode string is said to be in Stream-Safe Text Format if it would not contain any sequences of non-starters longer than 30 characters in length when normalized to NFKD.

Who's with me on 128-byte characters.