| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kevincox 1592 days ago

> Native UTF-8 in memory makes character indexing a non-constant time operation

The only reason that Java's UTF-16 has constant time indexing is because they use a braindead definition of character which is "UTF-16 codepoint".

If you want constant time character indexing you need to go UTF-32. But obviously the downsides are too great for most users. So in practice everyone uses UTF-8 because it is usually the most memory efficient.

Plus it turns out that character indexing isn't actually that common of an operation, so it is really the right move for almost every application.

2 comments

remexre 1592 days ago

UTF-32 isn't really a solution either, unless you consider a scalar value to be a character; I bet almost nobody wants U+0308 to be "a character"...

link

native_samples 1590 days ago

But in practice the Java definition of a character basically always works, because characters that aren't in the BMP are vanishingly rare in real software outside of emoji, and of course, Java long pre-dates emoji.

link