|
|
|
|
|
by ubernostrum
3211 days ago
|
|
As I said in the article, I think the overhead of adding yet more weirdness in the form of quirks of the internal encoding (which could vary according to how the Python interpreter was compiled!) is a bad thing to do on top of how much people seem to struggle mentally just to get Unicode all on its own. Though I also think the struggle is mostly due to people being stuck in an everything-is-like-ASCII mindset, and though I didn't get into that, it's one big reason why I think UTF-8 is generally the wrong way to expose Unicode to a programmer, since it lets them think they can keep that cherished "one byte == one character" assumption right up until something breaks at 2AM on a weekend. Personally I'd like everyone to just actually learn at least the things about Unicode that I went into here (such as why "one code point == one character" is a wrong assumption), and I think that'd alleviate a lot of the pain. I also avoided talking much about normalization, because too many people hear about it and decide they can just normalize to NFKC and go back to assuming code point/character equivalence post-normalization. |
|
Unfortunately, as long as you believe that you can index into a Unicode string, your code is going to break. The only question is how soon.
I actually like UTF-8 because it will break very quickly, and force the programmer to do the right thing. The first time you hit é or € or ️an emoji, you'll have a multibyte character, and you'll need to deal with it.
All the other options will also break, but later on:
- If you use UTF-16, then é and € will work, but emoji will still result in surrogate pairs.
- If you use a 4-byte representation, then you'll be able to treat most emoji as single characters. But then somebody will build é from two separate code points as "e + U+0301 COMBINING ACUTE ACCENT", or you'll run into a flag or skin color emoji, and once again, you're back at square zero.
You can't really index Unicode characters like ASCII strings. Written language is just too weird for that. But if you use UTF-8 (with a good API), then you'll be forced to accept that "str[3]" is hopeless very quickly. It helps a lot if your language has separate types for "byte" and "Unicode codepoint", however, so you can't accidentally treat a single byte as a character.