|
|
|
|
|
by est
4948 days ago
|
|
Yes? `System.Globalizatiion` or `ICU` can count grapheme, what's your point? Those libraries are equivalent to normalize( utf16 `0x00 0x41 0x03 0x08`) == length 1 Back to my top comment, I stated that UCS2 counts faster than UTf8 internally, because every BMP code point is just two bytes, what's wrong here? If variable-length is so good why py3k is using UCS-4 internally? (Wich means every character is exactly 32 bits. There, I said character again.) |
|
The part cmccabe tries to explain, and which you repeatedly fail to understand, is that UCS2 counts unicode code points faster than UTF-8, which is completely useless because "characters" (what the end-user sees as a single sub-unit of text) often spans multiple codepoints, so counting codepoints is essentially a recipe for broken code and nothing else.
> If variable-length is so good why py3k is using UCS-4 internally?
It's not. Before 3.3 it used either UCS2 or UCS4 internally, as did Python 2, since Python 3.3 it switches the internal encoding on the fly.
> Wich means every character is exactly 32 bits. There, I said character again.
yeah and you're wrong again.