Hacker News new | ask | show | jobs
by morpher 4787 days ago
This is entirely the wrong take-away message from this article. The point is that strings are not sequences of numbers, but are, rather sequences of characters. Characters are abstracted from the underlying byte representation which is unimportant when dealing with strings.

For situations where a concrete byte representation is needed, you can get one by encoding the string.

3 comments

Even this definition can get hairy, though. What is a character? Is 'á' one character or two? Most human beings would say one, but in actuality I formed it with an 'a' (U+0061) and a combining acute accent (U+0301): Two separate code points. But you can also get the same result with 'á' (U+00E1); this is not true of all combining character combinations.

In the past, I've had to deal with horrible mashups of fixed-byte-length columns in flat text files with UTF-8 bolted onto it. In Java, no less. Trying to figure out how to deal with all the edge cases (how do you truncate a string when the boundary is between a "normal" character and a combining character?) was an endless parade of the bizarre. Strings are hard, fundamentally.

In that case you “only” have to know what you're actually after: Either a grapheme (a character in the human sense) or a code point (a character in the Unicode sense) – well, and then there is the code unit which can be a “character” in the programming language sense but it's best not to go there, lest you want to fall into plenty of traps.

As long as you only wander around one of those levels (grapheme, code point, code unit, byte) all is (fairly) easy, but once you deal with multiple levels mistakes almost invariably creep in and you start treating code points as graphemes or code units as code points, etc. Fun source of all kinds of bugs :-)

So yes, text in general is hard. And Han, Hangul and the Japanese scripts are probably among the easiest scripts to support in software :-)

Yes. It's much better to think of strings as a sequence of Unicode code points.
As your parent already noted, thinking of it as a sequence of code points goes wrong when you need to truncate a string in between a base and a combining character.
Not true. Take this string:

It is composed of two code points: U+0064 and U+034A. The second code point is a combining character. The two code points together form one glyph. The term "character" is confusing because people use different definitions for it, I avoid using it, but the term Unicode code point is very clear.

Python 3's strings is a sequence of code points. The above string is represented like this:

  >>> print("d\u034A")
  d͊
  >>> len("d\u034A")
  2
Truncating between the base and combining code points works as expected:

  >>> "d\u034a"[0]
  'd'
  >>> "d\u034a"[1]
  '͊'
Except it doesn't work as expected because users generally expect graphemes to stay as they are instead of losing random diacritics.
By users, do you mean Python 3 programmers?
Indeed. I think Python 3 is very explicit with that distinction as well. You can have either text, which is in Unicode, or you have data which are arbitrary bytes. Sure, those bytes can represent text by interpreting them with a specific encoding, but you have to convert between one and the other explicitly to make it work. A very nice thing after the debacle in Python 2 where bytestrings in UTF-8 locales on Unix-likes happen to almost work in many cases, just to break horribly in other environments.

That being said, there are a lot of inaccuracies and even wrong things in that article, which saddens me.

If you think of a string as a sequence of code points (integers) you get a correct but inefficient model (UTF-32).

Character is a context in which we read integers and currently we don't use more than a couple hundred thousand of those.

UTF-8 is a data-compression technique taking advantage of the fact that smaller code points are used more often, and the largest ones (which require 5+ bytes, because any compression algorithm expands some inputs) are, currently, not standardized and effectively never used.

In Python 3, there is a clear distinction between string objects and bytes objects. String objects are a sequence of Unicode code points, and bytes objects are a sequence of 0-255 byte values.

> Character is a context in which we read integers and currently we don't use more than a couple hundred thousand of those.

I don't understand this sentence.

> UTF-8 is a data-compression technique taking advantage of the fact that smaller code points are used more often, and the largest ones (which require 5+ bytes, because any compression algorithm expands some inputs) are, currently, not standardized and effectively never used.

It's better to think of UTF-8 and UTF-32 as encodings: their role is to serialize strings into byte sequences and to deserialize byte sequences into strings. Some encodings are more efficient then others in different cases, but that doesn't change their role, and it's not necessary to understand this to avoid Unicode mistakes.

There never will be more than 4 bytes for UTF-8 because Unicode is restricted to 21 bits. Remember that all UTFs have to be able to represent all of Unicode and UTF-16 could not represent those “code points” where UTF-8 needs 5+ bytes.

Also I wouldn't say that UTF-8 is a compression scheme. SCSU is one but has its own share of problems. UTF-8 just happens to preserve ASCII compatibility which is an important property for Unix-like systems. Nothing more and nothing less. That is also happens to be more space-efficient for text that consists mostly of ASCII characters is merely a side-effect of that.

From the standpoint of an English speaker, UTF-8 is effectively a (good) compression scheme for Unicode, as opposed to using 2 or more bytes for every character.

I guess if I were German or Spanish (to say nothing of Asian languages), it would be the opposite of compression :-)