Hacker News new | ask | show | jobs
by est 4949 days ago
yeah sure why not.

    >>> u'𡘓'[0:1]
    u'\U00021613'

    >>> u'Hi, Mr𡘓'[-1]
    u'\U00021613

    >>> u'𠀋'[0:1]
    u'\U0002000b'

Javascript won't work because UCS2 in js engine, duh.

Actually Javascript is messed up with Unicode string and binary strings, that's why Nodejs invented Buffer

http://nodejs.org/api/buffer.html

1 comments

You've moved the goalposts:

  u'\U00021613'
This is a UTF-32 code unit, not a UTF-16 code unit. Even UTF-32 doesn't help when you have combining characters. I suggest you read dietrichepp's post again, he's talking about Normalization Form D.
Okay, if it's an explicit combining character what's wrong with explicit character part counting?

You know normalized form is the norm, right?

There are four different normalized forms in Unicode. Maybe you should enlighten us about which one you're talking about.

Or just stop embarrassing yourself.

Reading all of your comments, so you are suggesting a Unicode object should not have len() or substring() ?

A standard like that is totally not embarrassing.

I am suggesting that people read about unicode before designing supposedly cross-platform applications or programming languages. It's not that hard, just different than ASCII.
Since you understand Unicode so well, can you explain dietrichepp's theory that Unicode don't need counting or offsets?

http://news.ycombinator.com/item?id=4834931

And why UCS4 (Not variable-length) is chosen in many Unicode implementations? Why wchar_t is always 32bit in posix?