| Aha, thanks! So, yeah, that technical report is about security, typically the potential problems of making a username or domain name or other identifier look like another. While OP wasn't about security, it does sound like the mapping potentially has non-security uses too as in OP. (The term "normalization" with regard to unicode usually means something else, specifically NFC, NFD, NFKC, or NFKD normalization from UAX#15, making this hard to talk about clearly, not sure what word to use for this "confusables" mapping). I haven't actually seen this particular algorithm/mapping discussed before. I'm not sure if routines to perform the mapping are available on common languages/platforms (ruby, python, node, java) -- if someone knows how to do it with, say, Java ICU4J library, it would be useful to see an example. The confusables.txt file provided does look like it would make it easy to implement the mapping algorithm. I might give it a stab in ruby. It's a bit confusing to think about what non-security contexts it's applicable without removing semantics you'd want. In, fact, TR39 says "The strings skeleton(X) and skeleton(Y) are not intended for display, storage or transmission," it's not totally clear if they'd think it was a good idea to use it in OP use case? If anyone has seen any writing on, or has any thoughts on, how to approach thinking about what non-security use cases and contexts doing this international "confusables" mapping is appropriate vs loss of semantics, I'd love to see it! Like I'm trying to think of whether you might want to map down these "confusables" for search indexing; it also seems like in some cases, especially without locale-specific data, you might be losing semantics you want to keep by doing this. |