Hacker News new | ask | show | jobs
by Toxygene 645 days ago
Another option would be to detect and/or normalize Unicode input using the recommendations from the Unicode consortium.

https://www.unicode.org/reports/tr39/

Here's the relevant bit from the doc:

> For an input string X, define skeleton(X) to be the following transformation on the string:

    Convert X to NFD format, as described in [UAX15].
    Remove any characters in X that have the property Default_Ignorable_Code_Point.
    Concatenate the prototypes for each character in X according to the specified data, producing a string of exemplar characters.
    Reapply NFD.
The strings X and Y are defined to be confusable if and only if skeleton(X) = skeleton(Y). This is abbreviated as X ≅ Y.

This is obviously talking about comparing two string to see if they are "confusable" but if you just run the skeleton function on a string, you get a "normalize" version of it.

2 comments

This was my first thought -- I was specifically thinking the less typically used [K] "compatibility" normalization forms would do it.

But in fact, none of the unicode normalization forms seem to convert a `HYPHEN` to a `HYPHEN-MINUS`. Try it, you'll see!

Unicode considers them semantically different characters, and not normalized.

The default normalization forms NFC and NFD that are probably defaults for a "unicode normalize" function will should always result in exactly equivalent glyphs (displayed the same by a given font modulo bugs), just expressed differently in unicode. Like single code point "Latin Small Letter E with Acute" (composed, NFC form); vs two code points "latin small letter e" plus "combining acute accent" (decomposed, NFD form). I would not expect them to change the hyphen characters here -- and they do not.

The "compatibility" normalizations, abbreviated by "K" since "C" was already taken for "composed", WILL change glyphs. For instance, they will normalize a "Superscript One" `¹` or a "Circled Digit 1" `①` to an ordinary "Digit 1" (ascii 49). (which could also be relevant to this problem, and it's important all platforms expose compatibility normalization too!) NFKC for compatibility plus composed, or NFKD for compatibility plus decomposed. I expected/hoped they would change the unicode `HYPHEN` to the ascii `HYPHEN-MINUS` here.

But they don't seem to, the unicode directory decided these were not semantically equivalent even at "compatibility" level.

Unfortunately! I was hoping compatibility normalization would solve it too! The standard unicode normalization forms will not resolve this problem though.

(I forget if there are some locale-specific compatibility normalizations? And if so, maybe they would normalize this? I think of compat normalization as usually being like "for search results should it match" (sure you want `1` to match `①`), which can definitely be locale specific)

As you correctly observed, step one does not normalize 'HYPHEN-MINUS' to 'HYPHEN'. Instead, that occurs in step three, using the confusables data file [1].

[1] https://www.unicode.org/Public/security/8.0.0/confusables.tx...

Aha, thanks!

So, yeah, that technical report is about security, typically the potential problems of making a username or domain name or other identifier look like another.

While OP wasn't about security, it does sound like the mapping potentially has non-security uses too as in OP.

(The term "normalization" with regard to unicode usually means something else, specifically NFC, NFD, NFKC, or NFKD normalization from UAX#15, making this hard to talk about clearly, not sure what word to use for this "confusables" mapping).

I haven't actually seen this particular algorithm/mapping discussed before. I'm not sure if routines to perform the mapping are available on common languages/platforms (ruby, python, node, java) -- if someone knows how to do it with, say, Java ICU4J library, it would be useful to see an example.

The confusables.txt file provided does look like it would make it easy to implement the mapping algorithm. I might give it a stab in ruby.

It's a bit confusing to think about what non-security contexts it's applicable without removing semantics you'd want.

In, fact, TR39 says "The strings skeleton(X) and skeleton(Y) are not intended for display, storage or transmission," it's not totally clear if they'd think it was a good idea to use it in OP use case?

If anyone has seen any writing on, or has any thoughts on, how to approach thinking about what non-security use cases and contexts doing this international "confusables" mapping is appropriate vs loss of semantics, I'd love to see it! Like I'm trying to think of whether you might want to map down these "confusables" for search indexing; it also seems like in some cases, especially without locale-specific data, you might be losing semantics you want to keep by doing this.

Python even has a handy function for this: https://docs.python.org/3/library/unicodedata.html#unicodeda...