| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Avernar 3209 days ago

> Since the high-level API is supposed to let you treat a string as a sequence of code points,

I disagree with that premise. It should operate on grapheme clusters. Operating on code points falls into the same trap as operating on bytes.

> a correct implementation (which Python didn't have until 3.3!) would've imposed the overhead of conversion to something resembling a fixed-width encoding whenever a programmer invoked certain operations.

Those operations should have been removed. Indexing is the big one that needs fixed width internal representation for speed. Code could have been rewritten to not require indexing. But mechanical translation from Python 2 to 3 was a goal and because of that they couldn't radically change the unicode API for the better.

> And the vast majority of strings in real-world Python contain only code points also present in latin-1, which means they can be stored in one byte per code point with this approach. And for strings which can't be stored in one byte per code point, you were similarly going to pay the price sooner or later.

You're going to pay the price for 4 byte per codepoint strings quite often. A single emoji will blow up a latin-1 string to 4 times the size.