| HN Mirror

http://nodejs.org/api/buffer.html

yeah sure why not.

    >>> u'𡘓'[0:1]
    u'\U00021613'

    >>> u'Hi, Mr𡘓'[-1]
    u'\U00021613

    >>> u'𠀋'[0:1]
    u'\U0002000b'

Javascript won't work because UCS2 in js engine, duh.

Actually Javascript is messed up with Unicode string and binary strings, that's why Nodejs invented Buffer

codeka 4995 days ago

You've moved the goalposts:

  u'\U00021613'

This is a UTF-32 code unit, not a UTF-16 code unit. Even UTF-32 doesn't help when you have combining characters. I suggest you read dietrichepp's post again, he's talking about Normalization Form D.

Okay, if it's an explicit combining character what's wrong with explicit character part counting?

You know normalized form is the norm, right?

There are four different normalized forms in Unicode. Maybe you should enlighten us about which one you're talking about.

Or just stop embarrassing yourself.

Reading all of your comments, so you are suggesting a Unicode object should not have len() or substring() ?

A standard like that is totally not embarrassing.

Code points aren't letters.

Consider the following sequence of code points: U+0041 U+0308 [edit: corrected sequence]

That equals this european letter: Ä

Two code points, one letter. MAGIC! You can also get the same-looking letter with a single code point using U+00C4 (unicode likes redundancy).

Not all languages have letters. Not all languages that have letters represent each one with a single code point. Please think twice before calling people "morons."

> Two code points, one letter.

Yes I under stand there are million ways to display the same shape using various unicode. But how does that make code point counting impossible?

AND if you explictly using COMBINING DIAERESIS instead of single U+00C4, counting diaeresis separately is wrong somehow?

Why don't we make a law stating that both ae and æ is single letter?

I am responding to your earlier post which announced that UCS2 is better than UTF8 internally because it counts unicode characters faster than UTF8. Hopefully now you understand that just taking the number of UCS2 bytes and dividing by 2 does not give you the number of letters.

Just in case you don't, let's walk through it again.

UCS-16 big-endian represenation of Ä:

0x00 0x41 0x03 0x08

Another UCS-16 big-endian representation of Ä:

0x00 0xc4

If you look at the number of bytes, the first example has 4. It represents one letter. The second example has 2. It also represents one letter. Conclusion: UCS2 does not "count unicode characters faster than UTF8." You still have to look at every byte to see how many letters you have, same as in UTF-8.

Do you grasp this? If not, maybe you are one of those "ascii-centric ignorant morons" I keep hearing so much about.

Name one Unicode implementation which shows utf16 `0x00 0x41 0x03 0x08` as length 1.

U+4100 U+0803 is two code points by defintion. Thus length == 2.

http://stackoverflow.com/questions/4579215/cross-platform-it...

Yes? `System.Globalizatiion` or `ICU` can count grapheme, what's your point?

Those libraries are equivalent to normalize( utf16 `0x00 0x41 0x03 0x08`) == length 1

Back to my top comment, I stated that UCS2 counts faster than UTf8 internally, because every BMP code point is just two bytes, what's wrong here? If variable-length is so good why py3k is using UCS-4 internally? (Wich means every character is exactly 32 bits. There, I said character again.)