| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by taejo 4500 days ago

The unicode compatibility mappings (NFKC and NFKD) turn fullwidth Latin characters into ordinary Latin characters. [Wikipedia](https://en.wikipedia.org/wiki/Unicode_equivalence) says:

> In order to compare or search Unicode strings, software can use either composed or decomposed forms; this choice does not matter as long as it is the same for all strings involved in a search, comparison, etc. On the other hand, the choice of equivalence criteria can affect search results. For instance some typographic ligatures like U+FB03 (ﬃ), roman numerals like U+2168 (Ⅸ) and even subscripts and superscripts, e.g. U+2075 (⁵) have their own Unicode code points. Canonical normalization (NF) does not affect any of these, but compatibility normalization (NFK) will decompose the ffi ligature into the constituent letters, so a search for U+0066 (f) as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03. Likewise when searching for the Latin letter I (U+0049) in the precomposed Roman Numeral Ⅸ (U+2168). Similarly the superscript "⁵" (U+2075) is transformed to "5" (U+0035) by compatibility mapping.

Any good Unicode library should support normalization. For example in python:

   >>> import unicodedata
   >>> unicodedata.normalize('NFKD', u'ｆｕｌｌｗｉｄｔｈ－ｃｏｎｖｅｒｔｅｒ')
   u'fullwidth-converter'