| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by e12e 4617 days ago

TL;DR -- I generally fall in the category of counting graphemes, as per the second FAQ you linked -- when talking about user-facing text processing. I'm don't think it makes sense to try and have one api that tries to both appease (low level) programmers and end-users.

Perhaps I wasn't entirely clear - I certainly see that there are complications. I think you're overcomplicating your examples within the domain of text - I'd say composed characters counts as one, and reversing a string with a composed character, shouldn't reverse/destroy the compositon. The reverse of "õ" isn't "o~", but simply "õ" -- and the length of "o" and "õ" should both be one -- even if they aren't coded similarly.

Now, this won't work for lower level work on "computer language" strings -- so for your unicode-library or whatever you'd have to count differently. Obviously you have to do some magic when converting a multicode-encoded string from big-endian to little-endian and vice-versa -- but that's hardly the same operation as reversing a string.

I'm not familiar with thai, but to me it looks like your "กอ" and "กี" is equivalent to the Norwegian vowel "æ" which used to be written/typset as "ae" (and can still be considered a composition in some input locals). So the length of "ae" is 2, the length of "æ" is 1 as is the length of "a". That would mess up "ae" if reversed -- but I would consider that a "special/archaic" use-case. I'm not sure if that would be similar in Thai -- I don't know for example, if typewriters and computers have been wildly used for comparable time in Norway and Thailand (I'm guessing Thailand have a few thousand more years of printing/literacy).

As mentioned in my comment above, I also find it interesting that if we're taking length to mean "number of things in a sequence", the length of a sentence would be the number of words, the length of a word would be the number of graphemes and the length of a grapheme might either be the number of bit/bytes, or there might be a level in-between of composites.

So we might have:

   "This is an example.".length => 4 (or 5 or 8 depending
      on how we define spaces and punctuation)
   "This".length => 4
   "T".length => 1 byte,7 or 8 bits, or maybe even 2 in a
     prefix-based encoding (capital-transform t).

The logic would be that the full sentence is treated as a sequence of words that's treated as a sequence of graphemes that are treated as a sequence of codepoints that's treated as a stream of bits...