Hacker News new | ask | show | jobs
by ygra 4787 days ago
As your parent already noted, thinking of it as a sequence of code points goes wrong when you need to truncate a string in between a base and a combining character.
1 comments

Not true. Take this string:

It is composed of two code points: U+0064 and U+034A. The second code point is a combining character. The two code points together form one glyph. The term "character" is confusing because people use different definitions for it, I avoid using it, but the term Unicode code point is very clear.

Python 3's strings is a sequence of code points. The above string is represented like this:

  >>> print("d\u034A")
  d͊
  >>> len("d\u034A")
  2
Truncating between the base and combining code points works as expected:

  >>> "d\u034a"[0]
  'd'
  >>> "d\u034a"[1]
  '͊'
Except it doesn't work as expected because users generally expect graphemes to stay as they are instead of losing random diacritics.
By users, do you mean Python 3 programmers?