|
|
|
|
|
by Arnt
3206 days ago
|
|
I've done the math for a few of the programs I've worked on, and the waste was negligible every time. A lot of strings like "reply" that end up being ten bytes longer in UCS-4 than UTF-8 once you add all the object and allocator overhead, progressively fewer long strings. Even the string-heavy code I worked on didn't spend much more than ten per cent of its total memory on strings, having the typical object be 40 instead of 30 bytes wasn't a big deal then. Perhaps I should give an example. Suppose you're parsing and dealing with something. Say HTML since it's well-known. So you receive a long byte array starting "<html><body><p>Sometimes</p>". You parse the byte array and produce a number of objects, including up to four strings, namely "html", "body", "p" and "Sometimes", and by the time you've stored those in objects and allocated them, they occupy 32 bytes each on the heap. If you use UCS-4 the last may need 48 or 64, depending on your allocator's rounding and buckets. The byte array you for from the I/O subsystem may be 100k but most of the strings in the code are short, and the impact of using UCS-4 is moderate. A more interesting question is whether UCS-4's advantages are worth it. It provides an array of characters, but as the years pass, the code I see does ever less char-array processing on strings. 20-30 years ago the world was full of char pointers, now, not so much. Something like this looks more typical, and doesn't benefit much from UCS-4, if at all: foo.split(" ").each{|word| bar(word) }. |
|
You are looking at the issue from the perspective from a language user, not a language designer. 20 years ago we didn't have languages such as Python/Ruby which had internal multibyte support in their sting manipulation functions. 20 years ago string manipulation functions didn't even exist!
But this post is about the design of the language, not the application, and the language is still written in C/C++ and _internally_ stores strings as byte arrays that must be presented nicely to the programmer in that language's string manipulation functions.