| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ubernostrum 2737 days ago

Inventing your own pseudo-normalization of Unicode is a worse idea than using the actual normalization forms Unicode defines.

Also, if you think you can decompose without allocating memory... well, try a code point like U+FDFA.

For reference, its decomposition is:

U+0635 U+0644 U+0649 U+0020 U+0627 U+0644 U+0644 U+0647 U+0020 U+0639 U+0644 U+064A U+0647 U+0020 U+0648 U+0633 U+0644 U+0645

(and that doesn't begin to touch any of the potential issues with variant forms, homoglyph attacks, etc.)

1 comments

cryptonector 2737 days ago

There's nothing pseudo about it. To normalize both inputs first then compare, or normalize one character at a time and compare that is equivalent. There is a maximum number of codepoints in a canonical decomposition (or at least there used to be).

This is actually implemented in ZFS. (And also character-at-a-time normalization for hashing.)

I don't see how homoglyphs enter the picture. Can you explain?

link