| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by codeka 4951 days ago

I'm sorry but you're wrong. I suggest you inform yourself better of the subject you're talking about before you call people "ignorant morons" next time.

dietrichepp is talking about Normalized Form D, which is a valid form of Unicode and cannot be counted using codepoints like you're doing.

Maybe you can try:

'𠀋'.substr(0,1)

1 comments

est 4951 days ago

yeah sure why not.

    >>> u'𡘓'[0:1]
    u'\U00021613'

    >>> u'Hi, Mr𡘓'[-1]
    u'\U00021613

    >>> u'𠀋'[0:1]
    u'\U0002000b'

Javascript won't work because UCS2 in js engine, duh.

Actually Javascript is messed up with Unicode string and binary strings, that's why Nodejs invented Buffer

http://nodejs.org/api/buffer.html

link

codeka 4951 days ago

You've moved the goalposts:

  u'\U00021613'

This is a UTF-32 code unit, not a UTF-16 code unit. Even UTF-32 doesn't help when you have combining characters. I suggest you read dietrichepp's post again, he's talking about Normalization Form D.

link

est 4951 days ago

Okay, if it's an explicit combining character what's wrong with explicit character part counting?

You know normalized form is the norm, right?

link

cmccabe 4951 days ago

There are four different normalized forms in Unicode. Maybe you should enlighten us about which one you're talking about.

Or just stop embarrassing yourself.

link

est 4950 days ago

Reading all of your comments, so you are suggesting a Unicode object should not have len() or substring() ?

A standard like that is totally not embarrassing.

link

cmccabe 4950 days ago

I am suggesting that people read about unicode before designing supposedly cross-platform applications or programming languages. It's not that hard, just different than ASCII.

link