Hacker News new | ask | show | jobs
by waffle_ss 5401 days ago
Here's the Twitter dev article that I used for designing my URL shortener: https://dev.twitter.com/docs/counting-characters#Twitter_Cha...

Basically, they use the Normalization Form C of Unicode normalization which counts code points, not UTF-8 bytes.

1 comments

A slight modification to your tldr (it's a little confusing to me as I'm not so familiar with unicode):

Basically, they use count code points, not UTF-8 bytes.

Before they count the code points, they normalize using the Normalization Form C of Unicode (NFC), which aims to combine diacritics, so the "é" in "café" counts as 1 code point. If they used NFD to normalize, "é" would be normalized into 2 code points - "e" and a diacritic mark. If they didn't normalize, then it would be client dependent.

Normalization is distinct from encoding. A unicode string can be normalized in several different ways (including NFC and NFD), which changes the actual unicode code-points). Each normalized form can be encoded using several different encodings (i.e. utf-8, utf-32, latin1, etc). Normalization affects both the number of code points, and subsequently the number of bytes. Encoding (unicode -> bytes) changes the number of bytes, but should not affect the number of code points - code points should not be influenced by the encoding.

Recapping - Twitter does not count bytes. They count code points, after the code points have been normalized (combining diacritics with characters), so message length should not depend on the client.

Hmm, that wasn't so succinct.