|
|
|
|
|
by Toxygene
645 days ago
|
|
Another option would be to detect and/or normalize Unicode input using the recommendations from the Unicode consortium. https://www.unicode.org/reports/tr39/ Here's the relevant bit from the doc: > For an input string X, define skeleton(X) to be the following transformation on the string: Convert X to NFD format, as described in [UAX15].
Remove any characters in X that have the property Default_Ignorable_Code_Point.
Concatenate the prototypes for each character in X according to the specified data, producing a string of exemplar characters.
Reapply NFD.
The strings X and Y are defined to be confusable if and only if skeleton(X) = skeleton(Y). This is abbreviated as X ≅ Y.This is obviously talking about comparing two string to see if they are "confusable" but if you just run the skeleton function on a string, you get a "normalize" version of it. |
|
But in fact, none of the unicode normalization forms seem to convert a `HYPHEN` to a `HYPHEN-MINUS`. Try it, you'll see!
Unicode considers them semantically different characters, and not normalized.
The default normalization forms NFC and NFD that are probably defaults for a "unicode normalize" function will should always result in exactly equivalent glyphs (displayed the same by a given font modulo bugs), just expressed differently in unicode. Like single code point "Latin Small Letter E with Acute" (composed, NFC form); vs two code points "latin small letter e" plus "combining acute accent" (decomposed, NFD form). I would not expect them to change the hyphen characters here -- and they do not.
The "compatibility" normalizations, abbreviated by "K" since "C" was already taken for "composed", WILL change glyphs. For instance, they will normalize a "Superscript One" `¹` or a "Circled Digit 1" `①` to an ordinary "Digit 1" (ascii 49). (which could also be relevant to this problem, and it's important all platforms expose compatibility normalization too!) NFKC for compatibility plus composed, or NFKD for compatibility plus decomposed. I expected/hoped they would change the unicode `HYPHEN` to the ascii `HYPHEN-MINUS` here.
But they don't seem to, the unicode directory decided these were not semantically equivalent even at "compatibility" level.
Unfortunately! I was hoping compatibility normalization would solve it too! The standard unicode normalization forms will not resolve this problem though.
(I forget if there are some locale-specific compatibility normalizations? And if so, maybe they would normalize this? I think of compat normalization as usually being like "for search results should it match" (sure you want `1` to match `①`), which can definitely be locale specific)