Hacker News new | ask | show | jobs
by cpeterso 5406 days ago
So Twitter allows 140 (UTF-8?) characters, regardless of the number bytes? The article wasn't clear about this.
2 comments

Here's the Twitter dev article that I used for designing my URL shortener: https://dev.twitter.com/docs/counting-characters#Twitter_Cha...

Basically, they use the Normalization Form C of Unicode normalization which counts code points, not UTF-8 bytes.

A slight modification to your tldr (it's a little confusing to me as I'm not so familiar with unicode):

Basically, they use count code points, not UTF-8 bytes.

Before they count the code points, they normalize using the Normalization Form C of Unicode (NFC), which aims to combine diacritics, so the "é" in "café" counts as 1 code point. If they used NFD to normalize, "é" would be normalized into 2 code points - "e" and a diacritic mark. If they didn't normalize, then it would be client dependent.

Normalization is distinct from encoding. A unicode string can be normalized in several different ways (including NFC and NFD), which changes the actual unicode code-points). Each normalized form can be encoded using several different encodings (i.e. utf-8, utf-32, latin1, etc). Normalization affects both the number of code points, and subsequently the number of bytes. Encoding (unicode -> bytes) changes the number of bytes, but should not affect the number of code points - code points should not be influenced by the encoding.

Recapping - Twitter does not count bytes. They count code points, after the code points have been normalized (combining diacritics with characters), so message length should not depend on the client.

Hmm, that wasn't so succinct.

Twitters allows 140 codepoints of a sequence in normalization form C.
Wait, really? I thought normalization form C was the form where composite characters were always used when possible. Why restrict by the number of codepoints (vs characters) if you're explicitly going to use the form which goes out of its way to use multi-codepoint characters?

The only reason I can think of is that they internally use UTF-32, so counting codepoints is more efficient. But I thought they used UTF-8.

Edit: the other reason I can think of is that conversion to normalization form C already counts the codepoints. Though I can't imagine making it also count characters would be nontrivial.

Most likely, because Twitter is a microblog, and 140 character posts is it's thing.

I'm sure it had some historical technical limitation to 140 ASCII or maybe 70 utf-8 chars (or something else logical), but they probably had to accomodate people who wanted to use non-English characters in a post and not get a lecture on unicode encoding; and some slightly offensive "so ... people like you only get 70 chars" message.

The reason 140 was chosen had to do with SMS limits 160 chars). A 140 character limit left enough slack to allow metadata to be sent.
SMS has a limit of 140 bytes, not 160 characters.

In non-english languages, it is frequent to be restricted to 70 characters/SMS following usage due to a character not fitting in the 7bit GSM Alphabet[0]. When going over, the handset has to switch to UTF-16 (2 bytes per BMP character, 4 for non-BMP characters)

[0] http://en.wikipedia.org/wiki/GSM_03.38

You misunderstand. I'm wondering why they allow only 140 codepoints and not characters. Even in Unicode, 1 codepoint != 1 character.
Sorry, my Unicode is a bit weak.

I think Twitter should use whichever usually gives the user the most characters, to prevent them from getting burnt.

I think in many cases, the normalized form is more permissive, as it puts "character plus diacritic" together into one character. In a language with lots of diacritics, the number of codepoints might be more than the number of normalized characters (depending on the client). You wouldn't want to allow (say) ~70 Korean characters on one OS, and 140 on another, just because they use different codepoints to represent certain characters - one with character then diacritic (2 codepoints?), another with both crammed together in one codepoint.

But as I said, I'm not a unicode guru (and I don't know much about Korean, I just saw it as an example). This might be wrong.

No, you're absolutely right that counting by codepoints penalizes some languages. The surrogates I know best are simple, European ones: accents, for example. The normalization twitter uses according to other posters in this thread (the recommended one from the Unicode standards organization, Form C) always uses the multi-codepoint form when possible, for compatibility reasons. That's why it baffles me if they count by codepoint and not character!
> Wait, really? I thought normalization form C was the form where composite characters were always used when possible.

NFC is the form where composed codepoints are used the most, hence the "smallest" form post-normalization.

> Why restrict by the number of codepoints (vs characters) if you're explicitly going to use the form which goes out of its way to use multi-codepoint characters?

They're not, that would be NFD.

Well, crap. This is why Unicode is hard, folks. Thanks for the correction, masklinn.