A Tweet is Worth (at Least) 140 Words With this Compression Algorithm

Y	Hacker News new \| ask \| show \| jobs

	A Tweet is Worth (at Least) 140 Words With this Compression Algorithm (thevirtuosi.blogspot.com)
	67 points by Umalu 5401 days ago

11 comments

waitwhat 5401 days ago

A hack for the sake of doing a hack.

If you really want to compress as many words as you want into a tweet, just include a link: http://www.example.com/really-long-article.txt

link

Nick_C 5400 days ago

Heh, I never realised IANA supports example.com and example.org. Did you all know that already and forget to tell me?

link

waffle_ss 5401 days ago

I think the most realistic way to compress a tweet would be to replace words like "before" with "b4", "too"/"to" with "2", reduce whitespace (e.g. double spaces to single), and maybe start ripping out vowels ("vowels" -> "vwls").

Although not as efficient as demonstrated above, there are no external dependencies needed; the content can be decompressed by the reader's brain in-place at the slight cost of being difficult to immediately parse/understand.

link

Jach 5401 days ago

There's also a social status cost, which could be great or small, for example if your readers have an intense dislike for netspeak and poor English.

link

FuzzyDunlop 5401 days ago

Here in the UK it's interesting to note that, while text speak was all the rage a few years ago, it faded out.

It was replaceddddd by making wordssss actuallllllly longerrrrrr for no reeeeaaaasonnnnn!!!

Now the text speak is a bit more reined in and not totally incomprehensible like it used to get.

link

hammock 5400 days ago

Actually making the words longer like that goes a long way to expressing emotion in your texts - while avoiding the use of emoticons, which make you look like a girl.

For example, adding letters to a word is especially useful when teasing someone- gives them a hint you're not 100% serious.

link

andrewflnr 5401 days ago

I have observed that on occasion here in the US too.

link

rorrr 5401 days ago

And how would you uncompress it?

B4 bs ws 2 lt.

link

petercooper 5401 days ago

I wondered how a naive approach would work in comparison: You can reliably represent just over 2^20 codepoints in a UTF-8 character or 2800 bits over 140 characters. Standard ASCII is 2^7. 2800/7 gives us a potential 400 ASCII characters using a naive approach alone or a compression of 2.86x compared to the 5x he mentions.

link

blauwbilgorgel 5401 days ago

Great hacking. If you like this topic, this should be relevant: http://stackoverflow.com/questions/891643/twitter-image-enco... (Compressing images in tweets)

link

instakill 5401 days ago

All that's missing is a custom Twitter client that does compression upon tweeting and in the various timeline columns ala Tweetdeck, does decompression, it could make for something interesting. Too bad that would be against Twitter's TOS.

link

bmalicoat 5401 days ago

Good article, but isn't the author describing Huffman codes?

link

RodgerTheGreat 5401 days ago

Same basic idea. He's using variable-length input strings and mapping them to unique symbols rather than the other way around, but it seems like he's using entropy measurements to build an optimal trie.

link

akavi 5401 days ago

Is an arbitrary bit sequence valid unicode? If so, I'm curious if a simple per-character Huffman encoding of English would be more efficient.

link

JeremyBanks 5401 days ago

No, it's not. (Although "unicode" is ambiguous, because there are several unicode encodings.) Even if the binary sequence can be decoded as unicode code points, certain combinations of code points aren't allowed (surrogate pairs have to be matched).

link

jconnop 5401 days ago

To extend upon this idea - if you really wanted to maximise the data you could transmit in a single twitter message you could use the full 31bits of unicode (instead of just the chinese subset) and then apply standard lossless data compression techniques to the generated unicode for further improvement.

link

waffle_ss 5401 days ago

There are a lot of code points that would need to be filtered out if you do this - Noncharacters, Control codes, High/Low surrogates, Private-Use, Whitespace, and then of course the ones that mutate other code points in the sequence - Bidirectional, Combining characters / diacritical marks. It isn't quite as simple as just combining random 32-bit characters, as I found when creating my URL shortener.

If you want to play around with this, there is a helpful official (but really hard to find) Web app here: http://unicode.org/cldr/utility/properties.html

The filter I ended up using is:

  [:Diacritic=No:]&[:Noncharacter_Code_Point=No:]&[:Deprecated=No:]&[:White_Space=No:]&[:General_Category=Math_Symbol:]|[:General_Category=Symbol:]|[:General_Category=Letter:]|[:General_Category=Punctuation:]|[:General_Category=Currency_Symbol:]|[:General_Category=Number:]&[:General_Category!=Modifier_Letter:]&[:General_Category!=Modifier_Symbol:]

link

masklinn 5401 days ago

> High/Low surrogates

Surrogates are not codepoints.

link

psykotic 5401 days ago

That's not right. Surrogates are code points. That's the whole idea! It means you can express characters beyond the BMP with legacy encodings that were designed back when the entire code space could be coded in 16 bits. Newer encodings like UTF-8 don't have to rely on surrogates to do this because they have that capability at the bytewise encoding level.

http://www.google.com/search?&q=site%3Aunicode.org+surro...

link

adgar 5401 days ago

Each of the pair is a single codepoint; both combine to make one character.

link

tomotomo 5400 days ago

I made a quick Chrome extension (userscript wasn't going to have enough permissions) for this which is up at https://chrome.google.com/webstore/detail/idcnolgflhcckjdfpf...

link

waffle_ss 5401 days ago

Unicode is really fun. In a similar vein, my first Rails app was a URL shortener that also takes advantage of Twitter's Unicode character counting method:

http://menosgrande.org

link

cpeterso 5401 days ago

So Twitter allows 140 (UTF-8?) characters, regardless of the number bytes? The article wasn't clear about this.

link

waffle_ss 5401 days ago

Here's the Twitter dev article that I used for designing my URL shortener: https://dev.twitter.com/docs/counting-characters#Twitter_Cha...

Basically, they use the Normalization Form C of Unicode normalization which counts code points, not UTF-8 bytes.

link

wisty 5401 days ago

A slight modification to your tldr (it's a little confusing to me as I'm not so familiar with unicode):

Basically, they use count code points, not UTF-8 bytes.

Before they count the code points, they normalize using the Normalization Form C of Unicode (NFC), which aims to combine diacritics, so the "é" in "café" counts as 1 code point. If they used NFD to normalize, "é" would be normalized into 2 code points - "e" and a diacritic mark. If they didn't normalize, then it would be client dependent.

Normalization is distinct from encoding. A unicode string can be normalized in several different ways (including NFC and NFD), which changes the actual unicode code-points). Each normalized form can be encoded using several different encodings (i.e. utf-8, utf-32, latin1, etc). Normalization affects both the number of code points, and subsequently the number of bytes. Encoding (unicode -> bytes) changes the number of bytes, but should not affect the number of code points - code points should not be influenced by the encoding.

Recapping - Twitter does not count bytes. They count code points, after the code points have been normalized (combining diacritics with characters), so message length should not depend on the client.

Hmm, that wasn't so succinct.

link

masklinn 5401 days ago

Twitters allows 140 codepoints of a sequence in normalization form C.

link

adgar 5401 days ago

Wait, really? I thought normalization form C was the form where composite characters were always used when possible. Why restrict by the number of codepoints (vs characters) if you're explicitly going to use the form which goes out of its way to use multi-codepoint characters?

The only reason I can think of is that they internally use UTF-32, so counting codepoints is more efficient. But I thought they used UTF-8.

Edit: the other reason I can think of is that conversion to normalization form C already counts the codepoints. Though I can't imagine making it also count characters would be nontrivial.

link

wisty 5401 days ago

Most likely, because Twitter is a microblog, and 140 character posts is it's thing.

I'm sure it had some historical technical limitation to 140 ASCII or maybe 70 utf-8 chars (or something else logical), but they probably had to accomodate people who wanted to use non-English characters in a post and not get a lecture on unicode encoding; and some slightly offensive "so ... people like you only get 70 chars" message.

link

SoftwareMaven 5401 days ago

The reason 140 was chosen had to do with SMS limits 160 chars). A 140 character limit left enough slack to allow metadata to be sent.

link

masklinn 5401 days ago

SMS has a limit of 140 bytes, not 160 characters.

In non-english languages, it is frequent to be restricted to 70 characters/SMS following usage due to a character not fitting in the 7bit GSM Alphabet[0]. When going over, the handset has to switch to UTF-16 (2 bytes per BMP character, 4 for non-BMP characters)

[0] http://en.wikipedia.org/wiki/GSM_03.38

link

adgar 5401 days ago

You misunderstand. I'm wondering why they allow only 140 codepoints and not characters. Even in Unicode, 1 codepoint != 1 character.

link

wisty 5401 days ago

Sorry, my Unicode is a bit weak.

I think Twitter should use whichever usually gives the user the most characters, to prevent them from getting burnt.

I think in many cases, the normalized form is more permissive, as it puts "character plus diacritic" together into one character. In a language with lots of diacritics, the number of codepoints might be more than the number of normalized characters (depending on the client). You wouldn't want to allow (say) ~70 Korean characters on one OS, and 140 on another, just because they use different codepoints to represent certain characters - one with character then diacritic (2 codepoints?), another with both crammed together in one codepoint.

But as I said, I'm not a unicode guru (and I don't know much about Korean, I just saw it as an example). This might be wrong.

link

masklinn 5401 days ago

> Wait, really? I thought normalization form C was the form where composite characters were always used when possible.

NFC is the form where composed codepoints are used the most, hence the "smallest" form post-normalization.

> Why restrict by the number of codepoints (vs characters) if you're explicitly going to use the form which goes out of its way to use multi-codepoint characters?

They're not, that would be NFD.

link

adgar 5401 days ago

Well, crap. This is why Unicode is hard, folks. Thanks for the correction, masklinn.

link

heydenberk 5401 days ago

Would it be possible to use the tweet metadata to cram more data into a tweet?

link

MostAwesomeDude 5401 days ago

Yes! You can put whatever you want in the location field, including completely invalid locations. I remember a discussion last year about using those bytes for yet more space in a tweet, if your client software is looking for them.

link