Hacker News new | ask | show | jobs
by ubernostrum 3211 days ago
As I said in the article, I think the overhead of adding yet more weirdness in the form of quirks of the internal encoding (which could vary according to how the Python interpreter was compiled!) is a bad thing to do on top of how much people seem to struggle mentally just to get Unicode all on its own.

Though I also think the struggle is mostly due to people being stuck in an everything-is-like-ASCII mindset, and though I didn't get into that, it's one big reason why I think UTF-8 is generally the wrong way to expose Unicode to a programmer, since it lets them think they can keep that cherished "one byte == one character" assumption right up until something breaks at 2AM on a weekend.

Personally I'd like everyone to just actually learn at least the things about Unicode that I went into here (such as why "one code point == one character" is a wrong assumption), and I think that'd alleviate a lot of the pain. I also avoided talking much about normalization, because too many people hear about it and decide they can just normalize to NFKC and go back to assuming code point/character equivalence post-normalization.

2 comments

> it's one big reason why I think UTF-8 is generally the wrong way to expose Unicode to a programmer, since it lets them think they can keep that cherished "one byte == one character" assumption right up until something breaks at 2AM on a weekend.

Unfortunately, as long as you believe that you can index into a Unicode string, your code is going to break. The only question is how soon.

I actually like UTF-8 because it will break very quickly, and force the programmer to do the right thing. The first time you hit é or € or ️an emoji, you'll have a multibyte character, and you'll need to deal with it.

All the other options will also break, but later on:

- If you use UTF-16, then é and € will work, but emoji will still result in surrogate pairs.

- If you use a 4-byte representation, then you'll be able to treat most emoji as single characters. But then somebody will build é from two separate code points as "e + U+0301 COMBINING ACUTE ACCENT", or you'll run into a flag or skin color emoji, and once again, you're back at square zero.

You can't really index Unicode characters like ASCII strings. Written language is just too weird for that. But if you use UTF-8 (with a good API), then you'll be forced to accept that "str[3]" is hopeless very quickly. It helps a lot if your language has separate types for "byte" and "Unicode codepoint", however, so you can't accidentally treat a single byte as a character.

This sort of thing is why Swift treats grapheme clusters, rather than code points or bytes or "characters", as the fundamental unit of text. When I first started learning Swift I thought that was a weird choice that would just get in the way, but these days I'm coming around to their way of thinking.
Treating grapheme clusters as fundamental is slightly problematic in the sense that then the fundamentals change as Unicode adds more combining characters. A reasonable programming environment should still provide iteration by grapheme cluster as a library feature whose exact behavior is expected to change over time as the library tracks new Unicode versions.

Depending on the task at hand, iterating by UTF-8 byte or by code point can make sense, too. And the definition of these is frozen regardless of Unicode version, which makes these safer candidates for "fundamental" operations. There is no right unit of iteration for all tasks.

Pfft, that is just as bad. There is no 'fundamental unit of text'. There are different units of text that are appropriate to different tasks.

If I want to know how much memory to allocate, bytes are it. If I want to know how much screen space to allocate, font rendering metrics are it. If I want to do word-breaking, grapheme clusters are it.

None of these are fundamental.

Since a string doesn't have any font rendering metrics (in that it lacks a font or size), I'm not sure how you expect a language's String implementation to take it into account. Similarly, bytes will change based on encodings, which most people would expect a language's String type to abstract over. Do you really want UTF8 and UTF16 strings to behave differently and introduce even more complexity to a very complex system?

There are languages whose orthographies don't fit the Unicode grapheme cluster specification, but they're complex enough that I doubt there's any way to deal with them properly other than having someone proficient in them looking over your text processing or pawning it off to a library. At least with grapheme clusters your code won't choke on something as simple as unnormalized Latin text.

It's not about encodings at all, actually. It's about the API that is presented to the programmer.

And the way you take it all into account is by refusing to accept any defaults. So, for example, a string type should not have a "length" operation at all. It should have "length in code points", "length in graphemes" etc operations. And maybe, if you do decide to expose UTF-8 (which I think is a bad idea) - "length in bytes". But every time someone talks about the length, they should be forced to specify what they want (and hence think about why they actually need it).

Similarly, strings shouldn't support simple indexing - at all. They should support operations like "nth codepoint", "nth grapheme" etc. Again, forcing the programmer to decide every time, and to think about the implications of those decisions.

It wouldn't solve all problems, of course. But I bet it would reduce them significantly, because wrong assumptions about strings are the most common source of problems.

What do you mean by "expose UTF-8"? Because nothing about UTF-8 requires that you give byte access to the string.

As for indexing, strings shouldn't require indexing period. That's the ASCII way of thinking, especially fixed width columns and such. You should be thinking relatively. For example, find me the first space then using that point in the string the next character needs to be letter. When you build you're code that way you don't fall for the trap of byte indexing or the performance hit of codepoint indexing (UTF-8) or grapheme indexing (all encodings).

This is exactly how Swift’s Strings work.
>If I want to know how much memory to allocate, bytes are it. If I want to know how much screen space to allocate, font rendering metrics are it. If I want to do word-breaking, grapheme clusters are it.

Size in memory/bytes you could get trivially for any string (and this doesn't change with whether you chose bytes, graphemes or code points or whatever to iterate).

Screen space is irrelevant/orthogonal to encoding -- it appears at the font level and the font rendering engine that will give the metrics will accept whatever encoding it is.

>Screen space is irrelevant/orthogonal to encoding

Exactly. That's why measurements of string length shouldn't ever assume I'm looking for a unit-of-offset for a monospaced font.

The problem is that most naive programmers think that's what a string length is and should be.

I would love it if Python at least would support the '\X' regex metacharacter in its own built-in regex module. Right now you have to turn to a third-party implementation to get that.

I also wish Python would expose more of the Unicode database than it does; I've had to turn to third-party modules that basically build their own separate database for some Unicode stuff Python doesn't provide (like access to the Script property).

Unfortunately, as long as you believe that you can index into a Unicode string, your code is going to break. The only question is how soon.

Depends on what you want to index into it for. I'll admit that once upon a time I opposed adding a "truncate at N characters" template helper to Django since there was a real risk it would cut in the middle of a grapheme cluster, and I don't particularly care for the compromise that ended up getting it added (it normalizes the string-to-truncate to a composed form first to try to minimize the chance of slicing at a bad spot).

But when you get right down to it, what I do for a living is write web applications, and sometimes I have to write validation that cares about length, or about finding specific things in specific positions, and so indexing into a string is something I have do to from time to time, and I'd rather have it behave as a sequence of code points than have it behave as a sequence of bytes in a variable-width encoding.

As to whether UTF-8 forces people to deal with Unicode up-front, I very strongly disagree; UTF-8 literally has as a design goal that it puts off your need to think about anything that isn't ASCII.

Yes, while "back up one UTF-8 rune" is a well defined operation, "back up one grapheme" is tough. Forward is easy, though.

I had the need to write grapheme-level word wrap in Rust. Here it is. It assumes all graphemes have the same visible width. This is used mostly for debug output, not for general text rendering.

[1] https://github.com/John-Nagle/rust-rssclient/blob/master/src...

> But when you get right down to it, what I do for a living is write web applications,

That is my use case for Python as well.

> sometimes I have to write validation that cares about length,

That's where a trucation function that understands grapheme clusters whould come in so handy. Tell it that you want to truncate to n bytes maximum and let it chop a bit more as to not split a grapheme cluster.

Fortunately my database does not have fixed with strings so I rarely bump into this one.

> or about finding specific things in specific positions, and so indexing into a string is something I have do to from time to time

I write my code to avoid this. Yes I still have to use an index because that's what Python supports but it would be trivial to convert it to another language that supports string iterators.

I get the variable byte encodings. And I know that Unicode has things like U+0301 as you say, and so code points are not the same as characters/glyphs. But I don't understand why it was designed that way. Why is Unicode not simply an enumeration of characters.
It's important to distinguish between Unicode the spec, where people definitely do make that distinction, and implementations. Most of the problems are due to history: we have half a century of mostly English-speaking programmers assuming one character is one byte, especially bad when that's baked into APIs, and treating the problem as simpler than it really is.

Combining accents are a great example: if you're an American, especially in the 80s, it's easy to assume that you only need a couple of accents like you used in Spanish and French classes and that's really simple for converting old data to a new encoding. Later, it becomes obvious that far more are needed but by then there's a ton of code and data in the wild so you end up needing the concept of normalization for compatibility.

(That's the same lapse which lead to things like UCS-2 assuming 2^16 characters even though that's not enough for a full representation of Chinese alone.)

I think it's also worth remembering the combination of arrogance and laziness which was not uncommon in the field, especially in the 90s. I remember impassioned rants about how nobody needed anything more than ASCII from programmers who didn't want to have to deal with iconv, thought encoding was too much hassle, claimed it was too slow, etc. as if that excused not being able to handle valid requests. About a decade ago I worked at a major university where the account management system crashed on apostrophes or accents (in a heavily Italian town!) and it was just excused as the natural order of things so the team could work on more interesting problems.

One reason is because it would take a lot more code points to describe all the possible combinations.

Take the country flag emoji. They're actually two seperate code points. The 26 code points used are just special country code letters A to Z. The pair of letters is the country code and shows up as a flag. So just 26 codes to make all the flags in the world. Plus new ones can be added easily without having to add more code points.

Another example is the new skin tone emoji. The new codes are just the colour and are put in front of the existing emoji codes. Existing software just shows the normal coloured emoji but you may see a square box or question mark symbol in front of it.

>The pair of letters is the country code and shows up as a flag. So just 26 codes to make all the flags in the world. Plus new ones can be added easily without having to add more code points. Another example is the new skin tone emoji.

Still not answering the question though.

For one, when the unicode standard was originally designed it didn't have emoji in it.

Second, if it was limitations to the arbitrary addition of thousands of BS symbols like emoji that necessitate such a design, we could rather do without emojis in unicode at all (or klingon or whatever).

So, the question is rather: why not a design that doesn't need "normalization" and runes, code points, and all that...

Using less memory (like utf-8 allows) I guess is a valid concern.

It didn't have emoji but it did have other combining characters. While some langages it's feasable to normalize them to single code points but other langagues it would not be.

Plus the fact that some visible characters are made up of many graphemes the number of single code points would be huge.

As to your second point it seems to me to be a little close minded. The whole point of a universal character set was that languages can be added to it whether they be textual, symbolic or pictographic.

>As to your second point it seems to me to be a little close minded. The whole point of a universal character set was that languages can be added to it whether they be textual, symbolic or pictographic.

Representing all languages is ok as a goal -- adding klingon and BS emojis not so much (from a sanity perspective, if adding them meddled with having a logical and simple representation of characters).

So, it comes to "the fact that some visible characters are made up of many graphemes the number of single code points would be huge" and "while some languages it's feasable to normalize them to single code points but other langagues it would not be".

Wouldn't 32 bits be enough for all possible valid combinations? I see e.g. that: "The largest corpus of modern Chinese words is as listed in the Chinese Hanyucidian (汉语辞典), with 370,000 words derived from 23,000 characters".

And how many combinations are there of stuff like Hangul? I see that's 11,172. Accents in languages like Russian, Hungarian, Greek should be even easier.

Now, having each accented character as a separate might take some lookup tables -- but we already require tons of complicated lookup tables for string manipulation in UTF-8 implementations IIRC.

> So, the question is rather: why not a design that doesn't need "normalization" and runes, code points, and all that...

Because language is messy. At some point you have to start getting into the raw philosophy of language and it's not just a technical problem at that point but a political problem and an emotional problem.

Take accents as one example: in English a diaresis is a rare but sometimes useful accent mark to distinguish digraphs (coöperate should be pronounced as two Os, not one OOOH sound like in chicken coop) the letter stays the same it just has "bonus information"; in German an umlaut version of a letter (ö versus o) is considered an entirely different letter, with a different pronunciation and alphabet order (though further complicated by conversions to digraphs in some situations such as ö to oe).

Which language is "right"? The one that thinks that diaresis is merely a modifier or the one that thinks of an accented letter as a different letter from the unmodified? There isn't a right and wrong here, there's just different perspectives, different philosophies, huge histories of language evolution and divergence, and lots of people reusing similar looking concepts for vastly different needs.

Similarly the Spanish ñ is single letter to Spanish but the ~ accent may be a tone marker in another language that is important to the pronunciation of the word and a modifier to the letter rather a letter on its own.

There's the case of the overlaps where different alphabets diverged from similar origins. Are the letters that still look alike the same letters? [1]

Math is a language with a merged alphabet of latin characters, arabic characters, greek characters, monastery manuscript-derived shorthands, etc. Is the modern Greek Pi the same as the mathematical symbol Pi anymore? Do they need different representations? Do you need to distinguish, say in the context of modern Greek mathematical discussions the usage of Pi in the alphabet versus the usage of mathematical Pi?

These are just the easy examples in the mostly EFIGS space most of HN will be aware of. Multiply those sorts of philosophical complications across the spectrum of languages written across the world, the diversity of Asian scripts, and the wonder of ancient scripts, and yes the modern joy of emoji. Even "normalization" is a hack where you don't care about the philosophical meaning of a symbol, you just need to know if the symbols vaguely look alike, and even then there are so many different kinds of normalization available in Unicode because everyone can't always agree which things look alike either, because that changes with different perspectives from different languages.

[1] An excellent Venn diagram: https://en.wikipedia.org/wiki/File:Venn_diagram_showing_Gree...

Some languages use multiple marks on a single base character. The combining character system is more flexible, uses fewer code points, and doesn't require lookup tables to search or sort by base character.
In practice UTF-8 has done more to enable the wrong thing, rather than forcing programmers to do the right thing.

> You can't really index Unicode characters like ASCII strings

But then why do strings-are-UTF8 languages like Go or D make it so easy? Why optimize your syntax with `len(txt)` and `txt[0]` when, as you point out, you can't do that? Why make it trivial to split code points or composed character sequences, but doing something like proper string truncation is brutally hard?

UTF-8's fail-fast property has not enabled more Unicode-savviness. Instead it just lets programmers pretend that we still are in the land of C strings.

> Why optimize your syntax with `len(txt)` and `txt[0]` when, as you point out, you can't do that?

I like Rust's approach. It's a strings-are-UTF8 language but strings (both str and String):

- are not directly indexable

- force you to be explicit when iterating: you iterate over either `s.chars()` or `s.bytes()`

- are called out in the docs as being a vector of unsigned 8-bit integers internally

- support a len() method that is called out as returning the length of that vector

- can be sliced if you reaaaally need to get around inability to index directly but attempting to slice in the middle of a character causes a panic

> - support a len() method that is called out as returning the length of that vector

They should have called that one bytelen() then.

And how do you get a proper offset for slicing? Do you then have to interpret the UTF-8 bytes yourself, or can you somehow get it via the chars() iterator or something similar?

Yeah this is the way to go for sure.
> But then why do strings-are-UTF8 languages like Go

To clarify: strings in Go are not necessarily UTF-8. String literals will be, because the source code is defined to be UTF-8, but strings values in Go can contain any sequence of bytes: https://blog.golang.org/strings

Note that this prints 2, because the character contains two bytes in UTF-8, even though the two bytes correspond to one codepoint: https://play.golang.org/p/BqGzW1O2WX

Go also has the concept of a rune, which is separate from a byte and a string, and makes this easier when you're working with raw string encodings.

This makes it sound like Go is even more confused. If strings in Go are not necessarily UTF-8, why does the strings package assume UTF-8, `for range` assumes UTF-8, etc?
> If strings in Go are not necessarily UTF-8, why does the strings package assume UTF-8, `for range` assumes UTF-8, etc?

The blog post I linked to explains this in more detail, but in short: the `strings` package provides essentially the same functions as the `bytes` package does, except applied to work on UTF-8 strings. There are other packages for dealing with other text encodings.

The `for range` syntax is the one "special case", and it was done because the alternative (having it range over bytes instead of codepoints) is almost never desirable in practice[0], and it's easier to manually iterate the few times you do need it than it it would be to import a UTF-8 package just to iterate over a string 99.9% of the time.

[0] iterating over bytes is done all the time, of course, but usually at that point you're dealing with an actual slice of bytes already that you want to iterate over, not a string.

The point is that Go lumps together byte arrays and strings. It's a common flaw, but it's really unfortunate to see it perpetrated in a language that was designed after this lesson was already learned.

A byte array is a representation of a string, for sure. But strings themselves are higher-level abstractions. It shouldn't be that easy to mix the two.

An equivalent situation would be if integers were byte arrays. So len(x) would give you 4, for example, and you could do x[0], x[1] etc - except you would almost never actually do that in practice, and occasionally you'd end up doing the wrong thing by mistake.

If any language actually worked that way, everyone would be up in arms about it. Unfortunately, the same passes for strings, because of how conditioned we are to treat them as byte sequences.

Calling it "char" in C was probably the second million dollar mistake in the history of PL design, right after null.

> I think UTF-8 is generally the wrong way to expose Unicode to a programmer, since it lets them think they can keep that cherished "one byte == one character" assumption right up until something breaks at 2AM on a weekend.

The solution to that is simple, don't let the programmer access individual bytes in a Unicode string.

Get rid of indexing into them and replace it with iterators. Make string handling functions work on code points at the very least but better yet on grapheme clusters. There's a little more to it than that but it's a good start.

Yes, people are still stuck in the ASCII mindset and can't seem to get away from thinking in bytes. But I belive it's the ability to index into strings is what's to blame and not the encoding used.

Agreed, assuming O(1) lookup of anything inside a string only leads to bad encoding bugs. UTF-8 everywhere, no exceptions.

You can never assume any user-visible character will align evenly with any byte boundary, even if you're using UTF-32. Composed characters throw that assumption out the window, as well as dozens of other unicode quirks I can't recall now.