|
|
|
|
|
by e12e
4618 days ago
|
|
Interesting objection. To me it is rather obvious that a text-oriented language would treat any "string" as a) an atomic string and b) a sequence of the "next-lower" logical unit. I do just now realize that "An English sentence.".length could by this reasoning return 3 or 4 (3 words, one punctuation mark...). |
|
What is "ö".length?
It's one grapheme, an o with a diaeresis. It's two codepoints, an o (0x006F) with a combining diaeresis (0x0308). It's several bytes, depending on encoding.
How about if you reverse it first, so that the diaeresis doesn't have anything to combine with, and you have a bare letter 'o'? What's the length now? If you answered _one_ to the above, you've got a string whose length doubles when you reverse it. Is that what you want?
Too easy for you?
Let's take the Thai consonant "ก", which is a sort of a g, sort of k sound. One grapheme, one codepoint. Sorted. We'll add a vowel to it: "กอ". Two codepoints, but how many graphemes? One or two? Let's say two, but then let's point out that there is no logical difference there between that and a different vowel: "กี". This is a little more complicated? What's the length now? Is that one or two graphemes? It's clear as day that that's a single consonant + a single vowel, but how long is the string? How about: "เกียะ"? That's still a single consonant + a combining single vowel, only this time it's a compound vowel. One consonant, one vowel, how many graphemes? Are you using vertical slicing to determine what is and isn't a grapheme? Is that right?
To see this taken to its logical end by The Masters of Unicode: http://www.unicode.org/faq/char_combmark.html - "How are characters counted when measuring the length or position of a character in a string?"