Hacker News new | ask | show | jobs
by hsivonen 3822 days ago
I see that it's irksome, but as someone who works on the Web Platform, which takes backward compat seriously, I tend to view Python 3 as a mistake. I'm still hoping that they make Python5 that's compatible with Python 2.7 programs but otherwise brings in new features. I'm not holding by breath, though.

The saddest thing about Python 3 is that they made a breaking change to do Unicode "right" and still did it wrong. The right way to do Unicode is the way Rust does it: UTF-8 in memory and no (safe) API to introduce UTF-8 invalidity.

UTF-32 is wrong, bwcause it's wasteful and still doesn't accomplish what people naively expect due to grapheme clusters potentially taking more than one UTF-32 code unit.

2 comments

Python is UTF-8 by default and only upgrades to UTF-16 / 32 when it would make sense to do so given the characters in the string.

> UTF-32 is wrong, [because] it's wasteful and still doesn't accomplish what people naively expect due to grapheme clusters potentially taking more than one UTF-32 code unit.

Out of curiosity, is there a correct way to encode unicode that doesn't involve this level of surprise? I thought that this was still an unsolved problem at this point.

You get people to accept the truth that characters have a variable length in bytes.

Then you offer a data structure that lets you perform O(1) or O(logn) operations on sequences of single-character strings.

If it's read-only you could make it just be an index, blah blah the details don't matter a lot, the point is you can make something that's both correct to grapheme clusters and probably more space-efficient than UTF-32 despite the extra data.

And then the encoding inside the character strings isn't particularly important, but might as well use UTF-8.

-

Either that or make yourself a hilariously inefficient format based on:

UAX15-D3. Stream-Safe Text Format: A Unicode string is said to be in Stream-Safe Text Format if it would not contain any sequences of non-starters longer than 30 characters in length when normalized to NFKD.

Who's with me on 128-byte characters.

People should start calling the hypothesized next version Python 2.great.