| This was my first thought -- I was specifically thinking the less typically used [K] "compatibility" normalization forms would do it. But in fact, none of the unicode normalization forms seem to convert a `HYPHEN` to a `HYPHEN-MINUS`. Try it, you'll see! Unicode considers them semantically different characters, and not normalized. The default normalization forms NFC and NFD that are probably defaults for a "unicode normalize" function will should always result in exactly equivalent glyphs (displayed the same by a given font modulo bugs), just expressed differently in unicode. Like single code point "Latin Small Letter E with Acute" (composed, NFC form); vs two code points "latin small letter e" plus "combining acute accent" (decomposed, NFD form). I would not expect them to change the hyphen characters here -- and they do not. The "compatibility" normalizations, abbreviated by "K" since "C" was already taken for "composed", WILL change glyphs. For instance, they will normalize a "Superscript One" `¹` or a "Circled Digit 1" `①` to an ordinary "Digit 1" (ascii 49). (which could also be relevant to this problem, and it's important all platforms expose compatibility normalization too!) NFKC for compatibility plus composed, or NFKD for compatibility plus decomposed. I expected/hoped they would change the unicode `HYPHEN` to the ascii `HYPHEN-MINUS` here. But they don't seem to, the unicode directory decided these were not semantically equivalent even at "compatibility" level. Unfortunately! I was hoping compatibility normalization would solve it too! The standard unicode normalization forms will not resolve this problem though. (I forget if there are some locale-specific compatibility normalizations? And if so, maybe they would normalize this? I think of compat normalization as usually being like "for search results should it match" (sure you want `1` to match `①`), which can definitely be locale specific) |
[1] https://www.unicode.org/Public/security/8.0.0/confusables.tx...