| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by account42 1232 days ago
	AFAIK, overlong encodings are not valid UTF-8 [0] which means the canonical encoding of each unicode code point is clearly defined as there is only one valid encoding. That still leaves higher level Unicode shenanigans, but every Unicode encoding is going to have those. And of course, what applications accept in practice is an entirely different issue. [0] https://en.wikipedia.org/wiki/UTF-8#Overlong_encodings

1 comments

tracker1 1232 days ago

While the point on overlong encodings is true... there are multiple ways to do composition[1]. brown-skin + high-five or high-five + brown skin... etc.

Part of why it's a good idea to normalize input before password hashing, as an example... It will likely become more common over time to use emoji as passphrase input.

1. https://en.wikipedia.org/wiki/Unicode_equivalence

link

account42 1232 days ago

Yes, composing characters are what I was referencing with "higher level Unicode shenanigans". This doesn't stop there though - many people would say that "а" and "a" are encodings of the same character even if Unicode thinks otherwise. All that is above the concerns of UTF-8 though, which only cares about encoding code points into byte sequences.

link