Hacker News new | ask | show | jobs
by eps 4949 days ago
Which tools?

Honest question, as the three byte limit seems rather arbitrary and no more logical than, say, a four byte one.

2 comments

It is totally arbitrary - there's no reason you can't have degenerate 6-byte encodings, and compliant decoders should cope with them. See Marcus Kuhn's excellent UTF-8 decoder torture test page linked elsewhere in this thread.
Actually, a truly compliant UTF-8 decoder should reject all degenerate forms, because they can be used to bypass validation checks, and the spec does not allow such forms. A four-byte UTF-8 sequence is sufficient to represent any Unicode code point (and then some), so no compliant decoder should accept any sequences of 5 or 6 bytes.

Note that Kuhn's torture test page deliberately includes a lot of invalid sequences in order to make sure that they're gracefully handled. Section 4 is dedicated to degenerate encodings like this.

It's not 3 bytes so much as 16 bits, aka Unicode 1.0 limits. Which turn into 3 bytes in UTF-8.