Hacker News new | ask | show | jobs
by BorgHunter 4787 days ago
Even this definition can get hairy, though. What is a character? Is 'á' one character or two? Most human beings would say one, but in actuality I formed it with an 'a' (U+0061) and a combining acute accent (U+0301): Two separate code points. But you can also get the same result with 'á' (U+00E1); this is not true of all combining character combinations.

In the past, I've had to deal with horrible mashups of fixed-byte-length columns in flat text files with UTF-8 bolted onto it. In Java, no less. Trying to figure out how to deal with all the edge cases (how do you truncate a string when the boundary is between a "normal" character and a combining character?) was an endless parade of the bizarre. Strings are hard, fundamentally.

2 comments

In that case you “only” have to know what you're actually after: Either a grapheme (a character in the human sense) or a code point (a character in the Unicode sense) – well, and then there is the code unit which can be a “character” in the programming language sense but it's best not to go there, lest you want to fall into plenty of traps.

As long as you only wander around one of those levels (grapheme, code point, code unit, byte) all is (fairly) easy, but once you deal with multiple levels mistakes almost invariably creep in and you start treating code points as graphemes or code units as code points, etc. Fun source of all kinds of bugs :-)

So yes, text in general is hard. And Han, Hangul and the Japanese scripts are probably among the easiest scripts to support in software :-)

Yes. It's much better to think of strings as a sequence of Unicode code points.
As your parent already noted, thinking of it as a sequence of code points goes wrong when you need to truncate a string in between a base and a combining character.
Not true. Take this string:

It is composed of two code points: U+0064 and U+034A. The second code point is a combining character. The two code points together form one glyph. The term "character" is confusing because people use different definitions for it, I avoid using it, but the term Unicode code point is very clear.

Python 3's strings is a sequence of code points. The above string is represented like this:

  >>> print("d\u034A")
  d͊
  >>> len("d\u034A")
  2
Truncating between the base and combining code points works as expected:

  >>> "d\u034a"[0]
  'd'
  >>> "d\u034a"[1]
  '͊'
Except it doesn't work as expected because users generally expect graphemes to stay as they are instead of losing random diacritics.
By users, do you mean Python 3 programmers?