|
|
|
|
|
by taejo
4453 days ago
|
|
The unicode compatibility mappings (NFKC and NFKD) turn fullwidth Latin characters into ordinary Latin characters. [Wikipedia](https://en.wikipedia.org/wiki/Unicode_equivalence) says: > In order to compare or search Unicode strings, software can use either composed or decomposed forms; this choice does not matter as long as it is the same for all strings involved in a search, comparison, etc. On the other hand, the choice of equivalence criteria can affect search results. For instance some typographic ligatures like U+FB03 (ffi), roman numerals like U+2168 (Ⅸ) and even subscripts and superscripts, e.g. U+2075 (⁵) have their own Unicode code points. Canonical normalization (NF) does not affect any of these, but compatibility normalization (NFK) will decompose the ffi ligature into the constituent letters, so a search for U+0066 (f) as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03. Likewise when searching for the Latin letter I (U+0049) in the precomposed Roman Numeral Ⅸ (U+2168). Similarly the superscript "⁵" (U+2075) is transformed to "5" (U+0035) by compatibility mapping. Any good Unicode library should support normalization. For example in python: >>> import unicodedata
>>> unicodedata.normalize('NFKD', u'fullwidth-converter')
u'fullwidth-converter'
|
|