Hacker News new | ask | show | jobs
by luizfelberti 1062 days ago
You are correct that this is a detectable and entirely preventable failure, however, this is the way in which this can manifest:

- UTF-8 is a prefix-free self-synchronizing code;

- If the first byte of a UTF-8 codepoint starts with 0b0??????? then it is ASCII, and all is well;

- If the leading byte of the codepoint is 0b110? it means there is one continuation byte to follow. If its 0b1110? there are two bytes to follow, and so on up to a maximum of 4 continuation bytes, which is the limit for UTF-8;

- All continuation bytes have the pattern 0b10? and UTF-8 self synchronizes based on detecting the leading byte;

- The correct way to parse UTF-8 is to not believe these lengths AT ALL and actually run the UTF-8 state-machine over the entire input, which can be made quite fast by leveraging bit-parallel techniques (see Daniel Lemire's work);

- The way you shoot yourself in the foot is by believing the length and skipping over those bytes: an attacker makes the last codepoint one that expects a single continuation byte but does not include the continuation byte, the fancy pantsy "optimized" parser will skip over the closing quote and decohere the parse. This is only safe to do on pre-validated input, but even then it's kind of not worth it if you have access to a SIMD accelerated UTF-8 validator

Hope this clears it up!

PS: I DMed you on Discord ;)