| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Flimm 4787 days ago

Not true. Take this string:

d͊

It is composed of two code points: U+0064 and U+034A. The second code point is a combining character. The two code points together form one glyph. The term "character" is confusing because people use different definitions for it, I avoid using it, but the term Unicode code point is very clear.

Python 3's strings is a sequence of code points. The above string is represented like this:

  >>> print("d\u034A")
  d͊
  >>> len("d\u034A")
  2

Truncating between the base and combining code points works as expected:

  >>> "d\u034a"[0]
  'd'
  >>> "d\u034a"[1]
  '͊'

1 comments

ygra 4786 days ago

Except it doesn't work as expected because users generally expect graphemes to stay as they are instead of losing random diacritics.

link

Flimm 4786 days ago

By users, do you mean Python 3 programmers?

link