| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by waffle_ss 5401 days ago

There are a lot of code points that would need to be filtered out if you do this - Noncharacters, Control codes, High/Low surrogates, Private-Use, Whitespace, and then of course the ones that mutate other code points in the sequence - Bidirectional, Combining characters / diacritical marks. It isn't quite as simple as just combining random 32-bit characters, as I found when creating my URL shortener.

If you want to play around with this, there is a helpful official (but really hard to find) Web app here: http://unicode.org/cldr/utility/properties.html

The filter I ended up using is:

  [:Diacritic=No:]&[:Noncharacter_Code_Point=No:]&[:Deprecated=No:]&[:White_Space=No:]&[:General_Category=Math_Symbol:]|[:General_Category=Symbol:]|[:General_Category=Letter:]|[:General_Category=Punctuation:]|[:General_Category=Currency_Symbol:]|[:General_Category=Number:]&[:General_Category!=Modifier_Letter:]&[:General_Category!=Modifier_Symbol:]

1 comments

masklinn 5401 days ago

> High/Low surrogates

Surrogates are not codepoints.

link

psykotic 5401 days ago

That's not right. Surrogates are code points. That's the whole idea! It means you can express characters beyond the BMP with legacy encodings that were designed back when the entire code space could be coded in 16 bits. Newer encodings like UTF-8 don't have to rely on surrogates to do this because they have that capability at the bytewise encoding level.

http://www.google.com/search?&q=site%3Aunicode.org+surro...

link

adgar 5401 days ago

Each of the pair is a single codepoint; both combine to make one character.

link