|
|
|
|
|
by wrs
3213 days ago
|
|
This was a pretty gutsy move on Python's part. The presence of a single emoji in an English string will blow up memory usage for the whole string by 4x. And because graphemes aren't 1:1 to code points, the O(1) indexing and length operations you bought with that trade-off will still confuse people who don't understand Unicode. |
|
Though I also think the struggle is mostly due to people being stuck in an everything-is-like-ASCII mindset, and though I didn't get into that, it's one big reason why I think UTF-8 is generally the wrong way to expose Unicode to a programmer, since it lets them think they can keep that cherished "one byte == one character" assumption right up until something breaks at 2AM on a weekend.
Personally I'd like everyone to just actually learn at least the things about Unicode that I went into here (such as why "one code point == one character" is a wrong assumption), and I think that'd alleviate a lot of the pain. I also avoided talking much about normalization, because too many people hear about it and decide they can just normalize to NFKC and go back to assuming code point/character equivalence post-normalization.