Hacker News new | ask | show | jobs
by cmccabe 4948 days ago
I am responding to your earlier post which announced that UCS2 is better than UTF8 internally because it counts unicode characters faster than UTF8. Hopefully now you understand that just taking the number of UCS2 bytes and dividing by 2 does not give you the number of letters.

Just in case you don't, let's walk through it again.

UCS-16 big-endian represenation of Ä:

0x00 0x41 0x03 0x08

Another UCS-16 big-endian representation of Ä:

0x00 0xc4

If you look at the number of bytes, the first example has 4. It represents one letter. The second example has 2. It also represents one letter. Conclusion: UCS2 does not "count unicode characters faster than UTF8." You still have to look at every byte to see how many letters you have, same as in UTF-8.

Do you grasp this? If not, maybe you are one of those "ascii-centric ignorant morons" I keep hearing so much about.

1 comments

Name one Unicode implementation which shows utf16 `0x00 0x41 0x03 0x08` as length 1.

U+4100 U+0803 is two code points by defintion. Thus length == 2.

Yes? `System.Globalizatiion` or `ICU` can count grapheme, what's your point?

Those libraries are equivalent to normalize( utf16 `0x00 0x41 0x03 0x08`) == length 1

Back to my top comment, I stated that UCS2 counts faster than UTf8 internally, because every BMP code point is just two bytes, what's wrong here? If variable-length is so good why py3k is using UCS-4 internally? (Wich means every character is exactly 32 bits. There, I said character again.)

> Back to my top comment, I stated that UCS2 counts faster than UTf8 internally

The part cmccabe tries to explain, and which you repeatedly fail to understand, is that UCS2 counts unicode code points faster than UTF-8, which is completely useless because "characters" (what the end-user sees as a single sub-unit of text) often spans multiple codepoints, so counting codepoints is essentially a recipe for broken code and nothing else.

> If variable-length is so good why py3k is using UCS-4 internally?

It's not. Before 3.3 it used either UCS2 or UCS4 internally, as did Python 2, since Python 3.3 it switches the internal encoding on the fly.

> Wich means every character is exactly 32 bits. There, I said character again.

yeah and you're wrong again.

> yeah and you're wrong again.

http://en.wikipedia.org/wiki/UTF-32

UTF-32 (or UCS-4) is a protocol to encode Unicode characters that uses exactly 32 bits per Unicode code point.

http://www.unicode.org/faq/utf_bom.html#utf32-1