Hacker News new | ask | show | jobs
by camgunz 3213 days ago
UCS-4 is essentially never the right choice. It wastes space and thus messes up your cache. UCS-2 can be the right choice if the language you're encoding uses a lot of non-Latin glyphs (i.e. East Asian languages) but suffers from the same problem as UCS-2. UTF-8 is a good default: for most strings it's very compact, and for strings with a lot of multibyte codepoints it doesn't compare too unfavorably with UTF-16.

Python 3 tried to have its cake and eat it too by choosing the most compact encoding depending on the string, but in practice this wastes a lot of space. You'll double (or heaven forbid quadruple) your string size because of a single codepoint, and these codepoints are almost always a small percentage of the string. That's actually why UTF-16 and UTF-8 exist.

It would have been better for strings to default to UTF-8, and to add an optional encoding so the programmer can specify what kind of encoding to use. As it is now, in order to use (for example) UTF-16 strings in Python you have to keep them around as bytes, decode them to a string, perform string operations, and reencode them to bytes again. Any benefit you get from using UTF-16 vanishes the moment you need to operate on it like a string, in other words.

I get that the idea was to maintain indexing via codepoint, but (again) in practice that's not great: usually you want to index via grapheme -- if you want to index at all.

A better solution is to allow programmers to specify string encoding and default it to UTF-8. From that, there's a clear path to everything you'd want to do.

6 comments

I've done the math for a few of the programs I've worked on, and the waste was negligible every time. A lot of strings like "reply" that end up being ten bytes longer in UCS-4 than UTF-8 once you add all the object and allocator overhead, progressively fewer long strings. Even the string-heavy code I worked on didn't spend much more than ten per cent of its total memory on strings, having the typical object be 40 instead of 30 bytes wasn't a big deal then.

Perhaps I should give an example. Suppose you're parsing and dealing with something. Say HTML since it's well-known. So you receive a long byte array starting "<html><body><p>Sometimes</p>". You parse the byte array and produce a number of objects, including up to four strings, namely "html", "body", "p" and "Sometimes", and by the time you've stored those in objects and allocated them, they occupy 32 bytes each on the heap. If you use UCS-4 the last may need 48 or 64, depending on your allocator's rounding and buckets. The byte array you for from the I/O subsystem may be 100k but most of the strings in the code are short, and the impact of using UCS-4 is moderate.

A more interesting question is whether UCS-4's advantages are worth it. It provides an array of characters, but as the years pass, the code I see does ever less char-array processing on strings. 20-30 years ago the world was full of char pointers, now, not so much. Something like this looks more typical, and doesn't benefit much from UCS-4, if at all: foo.split(" ").each{|word| bar(word) }.

> A more interesting question is whether UCS-4's advantages are worth it. It provides an array of characters, but as the years pass, the code I see does ever less char-array processing on strings. 20-30 years ago the world was full of char pointers, now, not so much. Something like this looks more typical, and doesn't benefit much from UCS-4, if at all: foo.split(" ").each{|word| bar(word) }.

You are looking at the issue from the perspective from a language user, not a language designer. 20 years ago we didn't have languages such as Python/Ruby which had internal multibyte support in their sting manipulation functions. 20 years ago string manipulation functions didn't even exist!

But this post is about the design of the language, not the application, and the language is still written in C/C++ and _internally_ stores strings as byte arrays that must be presented nicely to the programmer in that language's string manipulation functions.

> You are looking at the issue from the perspective from a language user, not a language designer. 20 years ago we didn't have languages such as Python/Ruby which had internal multibyte support in their sting manipulation functions.

20 years ago was 1997. I'm reasonably certain NSString has been unicode-aware for much longer than that.

> 20 years ago string manipulation functions didn't even exist!

What kind of absolute utter nonsense is that?

> But this post is about the design of the language, not the application, and the language is still written in C/C++ and _internally_ stores strings as byte arrays that must be presented nicely to the programmer in that language's string manipulation functions.

So?

That should have been _30_ years ago string manipulation functions didn't exist.

NSString may have been Unicode-aware (I've never used Objective-C), and I believe that even the early Javas supported multibyte strings, but at that time most business and consumer desktop applications in the Windows world were still written in C/C++. Do you remember when the Euro symbol became common? I'm pretty sure that character alone was responsible for much of the push to support Unicode.

> I get that the idea was to maintain indexing via codepoint, but (again) in practice that's not great: usually you want to index via grapheme -- if you want to index at all.

I definitely need indexes, and I don't really care about graphemes. I actually have only a vague idea what that is.

I write parsers typically by using a global string and lots of indices. The important thing for me is to be able to extract characters and slices at given positions, and to be able to say "parse error at line X character Y" where X and Y are helpful to the user most of the time.

I would be absolutely fine with working in UTF-8 bytes only (and that would be faster I guess), but there would be a more pressing need to recompute character positions (as a code point or grapheme index) from byte offsets at times.

There are more abstract parsing methods where parser subroutines are implemented in a position agnostic way, but I'm very happy with my simple method.

If everything works on graphemes instead of code points (as I think does Perl6) I will be happy to use that, but it's not so important from a practical standpoint.

The issue with your model is that’s graphemes ultimately don’t behave like you may expect a character to. For example, it may take multiple code points to make a grapheme, so getting the index of any particular one might require walking the string instead of a constant time access since any one code point could be “globbed” with its neighbors into a grapheme–in a way that is dependent on its neighbors.
> I definitely need indexes

No you don't. You need iterators, which behave like pointers. Let's say you're hundreds or thousands of characters into a string at the start of some token. Now you want to scan from that position to the end of the token.

With indexes it works fast only if it's by codepoint. in a language that properly supports graphemes this would mean it would have to scan from the beginning to get to that index.

With iterators it can start scanning from that position directly. Same speed no matter where you are in the string. With indexes the larger your input the slower your parse gets, and not in a linear way.

It's also super easy to get a slice using a start and end iterator. As for line x character y messages, you can't get that directly from an index as it depends on how many new lines you parsed so indexing doesn't help there.

Well, I could roll my own iterator which encapsulates a string and some position information, but then I'd have to wrap a lot of different operations, like advance, advance by n, compare two iterators by position, test for end position, extract character, extract slice, etc.

And the code would get a lot noiser, while the only advantage I see is graphemes support, which I have never needed so far. (And I hope graphemes are actually designed with a similar sensibility for technical concerns as is UTF-8, where I can simply parse with indexes at the byte level, looking only for ASCII characters, without headaches and with maximum performance.)

As for getting line/character from a byte or codepoint offset, that's no problem if I do the calculation only in case of an error. The alternative would be to do it on each advance, which again means ADT wrapping, thus line noise and slower performance.

I'm not advocating that the programmer needs to implement the iterators but that the language/runtime have built in support for them.

As for searching for ASCII, which is prevalent in parsing, the iterator function to find the next specified character can do a low level and fast byte search. That's one of the benefits of UTF-8, searching for ASCII characters is super fast.

You wouldn't have to do the character position on each advance. Just have a beginning of line iterator that's updated every time you see a newline character and on error you do call a function that gives you how many characters between the current position iterator and the start of line iterator.

Working with iterators is no more coplex than working with indexes. But it's the language that needs to provide them.

Formal parsers do without indexing, but those rolled by hand often do, for simplicity's sake. I think these cases can still be serviced by permitting indexes, but backing them by a lazily computed index table.
> I get that the idea was to maintain indexing via codepoint, but (again) in practice that's not great

Of course not, but it was considered that breaking O(1) indexing guarantees were a bridge too far even in the breaky release of Python 3.

UTF-16 is a right choice if you are in asia or outside of the west. UTF-8 is the right choice if you are in america or using the internet.

> A better solution is to allow programmers to specify string encoding and default it to UTF-8.

Agreed. UTF-8 is the sensible default for most people.

> Any benefit you get from using UTF-16 vanishes the moment you need to operate on it like a string, in other words.

So, don't decode to a string, and do all your character manipulation on the bytes.

> A better solution is to allow programmers to specify string encoding and default it to UTF-8.

Absolutely not: the internal representation of a string should be of no interest to a user of your language. The 'best' solution is to represent strings as a list of index lookups into a palette, and to update the palette as new graphemes are seen. This is similar to the approach Perl6 is using[0].

[0]: https://6guts.wordpress.com/2015/12/05/getting-closer-to-chr...

> So, don't decode to a string, and do all your character manipulation on the bytes.

WHAT?!? I suppose that you've only ever worked with Latin characters. Please show a code example of changing European to African in this sentence in your language of choice, working on the bytes in any multi-byte encoding:

מהי מהירות האווירית של סנונית ארופאית ללא משא?‏

Yes, that is a Hebrew Monty Python quote. Now try it with a smiley somewhere in the string (HN filtered out my attempt to post the string with a smiley).

Is each application to maintain their own dictionary of code points? If the map is to be in a library, then why not have it in the language itself?

I don't understand your complaints. You clearly have some task you have in mind that you wish to perform: why not tell me what it is?

> Please show a code example of changing European to African in this sentence in your language of choice, working on the bytes in any multi-byte encoding:

מהי מהירות האווירית של סנונית ארופאית ללא משא?‏

I don't see the string 'European' in that sentence, it seems to be solely comprised of Hebrew characters.

edit to attempt to answer your question:

    struct m {
        pos_t start;
        pos_t end;
    }

    int findsn(char* str, char* substr, match m) {
        next: for( int c_i = 0; c_i++; s[c_i] != '\0' ) {
            match.start = c_i;
            int s_i = 0;
            for( ; s_i++; substr[s_i] != '\0' ) {
                if( str[c_i] != substr[s_i] ) goto next;
            }
            match.end = c_i + s_i;
            return 1;
        }
        return 0;
    }

    char* replacesn(char* str, char* needle, char* rpl) {
        match m;
        if( findsn(str, needle, &m) ) {
            splicesn(str, m.start, m.end, rpl);
        }
        return str;
    }
splicesn should be obvious, and you normalise your strings before calling replacesn. This is just me crappily re-implementing a fraction of the wchar API without checking MSDN.

edit 2:

> Is each application to maintain their own dictionary of code points?

No, you use the system/standard library for composing/decomposing/normalising codepoints.

> If the map is to be in a library, then why not have it in the language itself?

Why not indeed? What a great idea.

You win on the string replace, that was a bad example. Try a regex replace! But I will also mention that seeing properly indented code with clear identifier names is refreshing where I work!

> Why not indeed? What a great idea.

It sounded to me that you were arguing that string manipulation functions do not need to be included in modern programming languages. You said: "don't decode to a string, and do all your character manipulation on the bytes"

OK, I see how what I said could mean that. What I meant was: if using the language's internal string representation gives poor performance/resource usage, better to avoid it and directly manipulate the undecoded bytes. Most languages allow you to control when loaded data is converted to strings; simply don't convert it, and uh reimplement stdlib functions to work with your preferred encoding.
Except in East Asia with a population of over one billion.
Yeah I think UTF-8 is pretty euro-centric and that doesn't get enough play. Being able to set the default string encoding in your program would do a lot to alleviate that, I wish there were a language that provided it.