Hacker News new | ask | show | jobs
by est 4948 days ago
> See? Expected, broken behavior you get when splitting on character boundaries.

Yeah, like your Jamo trick is complex for a native CJK speaker.

Thought Jamo is hard? Check out Ideographic Description Sequence. We have like millions of 偏旁部首笔画 that you can freestyle combine with.

And the fun is the relative length of glypes, 土 and 士 is different, only because one line is longer that the other. How would you distinguish that?

But you know what your problem is?

It's like arguing with you that you think ส็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็ is only one character.

IMPOSSIBU?!!!???

And because U+202e exists on the lolternet so we deprive your ability to count 99% normal CJK characters???!??!111!

Combination characters is normalized to single character in most cases, and should be countable and indexable separately.

If you type combination characters EXPLICITLY, they will be counted with each combination, naturally, what's wrong with that?

Or else why don't we abandon Unicode, every country deal with their own weird glype composition shit?

1 comments

By being unnecessarily insulting you're degrading your own position, and your argument is collapsing under the weight of your anger and sarcasm. I see Dietrich and Colin working with definitions of code point, character and glyph that illuminates why counting one way will lead to problems when you slip into thinking you're counting the other way. Then in your much-too-fired-up responses you conflate them again, and muddy us back to square one.

It seems to me you're deriding us for being native speakers of languages with alphabets, and also deriding us for wanting APIs that prevent developers from alphabet-language backgrounds from making the mistakes our assumptions would incline us towards. You're going to have to decide if you're angry because you like the "simplicity" of UTF-16, because we don't speak a CJK language as well as you do (maybe Dietrich or Colin does; I have no idea) or because you're just angry and this is where you've come to blow off steam. If it's the third, I hope you'll try Reddit first next time, since this kind of behavior seems to be a lot more acceptable there than here.

For fuck's sake I am not defending UTF16's simplicity, I am defending that:

fixed width can count code points (I worded it as "character") faster than variable-length

Then his dietrichepp tries to educate me two code points combined should be treated equaly with another single code point, WTF y u no normalization?

Downvote me as you like, but you can't change the fact that UCS4 is used internally in Unicode systems.

Any reason other than for faster code point counting?

-----------------

dietrichepp also offended me that unicode characters should not count or offset. QTF:

> Why do you want to count Unicode characters? Why do you care if it is fast to do so? Why would you ever need to use character-based string indexing?

What in the world is your goal in this conversation? In your impotent rage you've only established that it's useless to count code points, completely counter to your original point in favor of UCS-2.