| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by raiph 1762 days ago

> The fact that they went out of their way to break python 2 unicode when running on python 3 was just totally nuts. Especially after making such a big deal about unicode!

Imo it's infinitely worse than that.

The big deal about Unicode is its nature, as defined in the "Summary Narrative" from 1991[0]. To wit:

> The Unicode character encoding derives its name from three main goals:

* universal (addressing the needs of world languages)

* uniform (fixed-width codes for efficient access), and

* unique (bit sequence has only one interpretation into character codes)

The Unicode folk realized that it would take decades to shift developers worldwide to doing that properly, so they adopted a three stage plan for software (eg the string types of programming languages) to get from where things were, to where they needed to be:

* Stage #1: Character = byte

* Stage #2: Character = code point

* Stage #3: Character = what a user thinks of as a character[1]

Python 1 was a Stage #1 language -- Character = byte -- like most others of its time.

In Python 2 there were tweaks to try move toward Stage #2 -- Character = code point, again, like most other PLs of its time.

In Python 3, they dictated a full switch to Stage #2 --- Character = code point. That was an unnecessarily painful break relative to Python 2. But -- and this is what really matters -- they entirely ignored Stage #3, which is the whole point of Unicode in the final analysis.

[0] https://www.unicode.org/history/summary.html

[1] https://unicode.org/glossary/#grapheme