Hacker News new | ask | show | jobs
by remram 1390 days ago
Some Unicode characters can be represented with different sequences of Unicode codepoints. For example é can be a single codepoint U+00E9 "latin small letter E with acute" or it can be the two codepoints U+0065 "latin small letter E" and U+0301 "combining acute accent".

This is independent of the Unicode encoding, which turns those codepoints into bytes, for example using UTF-8 this gives C3A9 or 65CC81.

Users don't really have control about what their keyboard/application is putting in the text field when they press the button, and obviously the hash of those is different so the password wouldn't match. Normalization is the process of turning the characters into its composed form (in my example "\u00E9") or the decomposed form ("\u0065\u0301"), so you can then compare your codepoints/bytes/hashes.

https://en.wikipedia.org/wiki/Unicode_equivalence#Normalizat...