| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by camgunz 3213 days ago

UCS-4 is essentially never the right choice. It wastes space and thus messes up your cache. UCS-2 can be the right choice if the language you're encoding uses a lot of non-Latin glyphs (i.e. East Asian languages) but suffers from the same problem as UCS-2. UTF-8 is a good default: for most strings it's very compact, and for strings with a lot of multibyte codepoints it doesn't compare too unfavorably with UTF-16.

Python 3 tried to have its cake and eat it too by choosing the most compact encoding depending on the string, but in practice this wastes a lot of space. You'll double (or heaven forbid quadruple) your string size because of a single codepoint, and these codepoints are almost always a small percentage of the string. That's actually why UTF-16 and UTF-8 exist.

It would have been better for strings to default to UTF-8, and to add an optional encoding so the programmer can specify what kind of encoding to use. As it is now, in order to use (for example) UTF-16 strings in Python you have to keep them around as bytes, decode them to a string, perform string operations, and reencode them to bytes again. Any benefit you get from using UTF-16 vanishes the moment you need to operate on it like a string, in other words.

I get that the idea was to maintain indexing via codepoint, but (again) in practice that's not great: usually you want to index via grapheme -- if you want to index at all.

A better solution is to allow programmers to specify string encoding and default it to UTF-8. From that, there's a clear path to everything you'd want to do.

6 comments

Arnt 3212 days ago

I've done the math for a few of the programs I've worked on, and the waste was negligible every time. A lot of strings like "reply" that end up being ten bytes longer in UCS-4 than UTF-8 once you add all the object and allocator overhead, progressively fewer long strings. Even the string-heavy code I worked on didn't spend much more than ten per cent of its total memory on strings, having the typical object be 40 instead of 30 bytes wasn't a big deal then.

Perhaps I should give an example. Suppose you're parsing and dealing with something. Say HTML since it's well-known. So you receive a long byte array starting "<html><body><p>Sometimes</p>". You parse the byte array and produce a number of objects, including up to four strings, namely "html", "body", "p" and "Sometimes", and by the time you've stored those in objects and allocated them, they occupy 32 bytes each on the heap. If you use UCS-4 the last may need 48 or 64, depending on your allocator's rounding and buckets. The byte array you for from the I/O subsystem may be 100k but most of the strings in the code are short, and the impact of using UCS-4 is moderate.

A more interesting question is whether UCS-4's advantages are worth it. It provides an array of characters, but as the years pass, the code I see does ever less char-array processing on strings. 20-30 years ago the world was full of char pointers, now, not so much. Something like this looks more typical, and doesn't benefit much from UCS-4, if at all: foo.split(" ").each{|word| bar(word) }.