Hacker News new | ask | show | jobs
by ciprian_craciun 1232 days ago
UTF-8 is both "binary" and "text" at the same time. :)

It's binary because the UTF-8 standard states how each Unicode code-point (i.e. character) is to be translated into a series of bytes.

But, because each (correct) UTF-8 byte sequence can be translated back into a Unicode code-point sequence, you can see it as text also. :)

(BTW, from my knowledge, UTF-8 doesn't specify a canonical encoding of Unicode text, thus for cryptographic purposes, especially signatures, perhaps one should treat it with care.)

1 comments

AFAIK, overlong encodings are not valid UTF-8 [0] which means the canonical encoding of each unicode code point is clearly defined as there is only one valid encoding. That still leaves higher level Unicode shenanigans, but every Unicode encoding is going to have those. And of course, what applications accept in practice is an entirely different issue.

[0] https://en.wikipedia.org/wiki/UTF-8#Overlong_encodings

While the point on overlong encodings is true... there are multiple ways to do composition[1]. brown-skin + high-five or high-five + brown skin... etc.

Part of why it's a good idea to normalize input before password hashing, as an example... It will likely become more common over time to use emoji as passphrase input.

1. https://en.wikipedia.org/wiki/Unicode_equivalence

Yes, composing characters are what I was referencing with "higher level Unicode shenanigans". This doesn't stop there though - many people would say that "а" and "a" are encodings of the same character even if Unicode thinks otherwise. All that is above the concerns of UTF-8 though, which only cares about encoding code points into byte sequences.