Hacker News new | ask | show | jobs
by cryptonector 2733 days ago
If you're just comparing strings then just do character-at-a-time comparison, which allows you to decompose (no need to recompose) and only one character at a time (look ma', no allocation needed), compare the two decomposed characters' codepoints, then fail or move on to the next character. I call this form-insensitive string comparison.
1 comments

Inventing your own pseudo-normalization of Unicode is a worse idea than using the actual normalization forms Unicode defines.

Also, if you think you can decompose without allocating memory... well, try a code point like U+FDFA.

For reference, its decomposition is:

U+0635 U+0644 U+0649 U+0020 U+0627 U+0644 U+0644 U+0647 U+0020 U+0639 U+0644 U+064A U+0647 U+0020 U+0648 U+0633 U+0644 U+0645

(and that doesn't begin to touch any of the potential issues with variant forms, homoglyph attacks, etc.)

There's nothing pseudo about it. To normalize both inputs first then compare, or normalize one character at a time and compare that is equivalent. There is a maximum number of codepoints in a canonical decomposition (or at least there used to be).

This is actually implemented in ZFS. (And also character-at-a-time normalization for hashing.)

I don't see how homoglyphs enter the picture. Can you explain?