| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jconnop 5401 days ago
	To extend upon this idea - if you really wanted to maximise the data you could transmit in a single twitter message you could use the full 31bits of unicode (instead of just the chinese subset) and then apply standard lossless data compression techniques to the generated unicode for further improvement.

1 comments

waffle_ss 5401 days ago

There are a lot of code points that would need to be filtered out if you do this - Noncharacters, Control codes, High/Low surrogates, Private-Use, Whitespace, and then of course the ones that mutate other code points in the sequence - Bidirectional, Combining characters / diacritical marks. It isn't quite as simple as just combining random 32-bit characters, as I found when creating my URL shortener.

If you want to play around with this, there is a helpful official (but really hard to find) Web app here: http://unicode.org/cldr/utility/properties.html

The filter I ended up using is:

  [:Diacritic=No:]&[:Noncharacter_Code_Point=No:]&[:Deprecated=No:]&[:White_Space=No:]&[:General_Category=Math_Symbol:]|[:General_Category=Symbol:]|[:General_Category=Letter:]|[:General_Category=Punctuation:]|[:General_Category=Currency_Symbol:]|[:General_Category=Number:]&[:General_Category!=Modifier_Letter:]&[:General_Category!=Modifier_Symbol:]

link

masklinn 5401 days ago

> High/Low surrogates

Surrogates are not codepoints.

link

psykotic 5401 days ago

That's not right. Surrogates are code points. That's the whole idea! It means you can express characters beyond the BMP with legacy encodings that were designed back when the entire code space could be coded in 16 bits. Newer encodings like UTF-8 don't have to rely on surrogates to do this because they have that capability at the bytewise encoding level.

http://www.google.com/search?&q=site%3Aunicode.org+surro...

link

adgar 5401 days ago

Each of the pair is a single codepoint; both combine to make one character.

link