|
|
|
|
|
by eatonphil
1063 days ago
|
|
> Consider a corrupted codepoint at the end of a user generated string: will it recognize the closing quote as such, or will it assume it is part of a corrupted codepoint and try to skip over it? Maybe I'm misunderstanding you, but because of how UTF-8 is a superset of ASCII, I don't believe you can misrecognize ASCII characters if that's what you mean. |
|
- UTF-8 is a prefix-free self-synchronizing code;
- If the first byte of a UTF-8 codepoint starts with 0b0??????? then it is ASCII, and all is well;
- If the leading byte of the codepoint is 0b110? it means there is one continuation byte to follow. If its 0b1110? there are two bytes to follow, and so on up to a maximum of 4 continuation bytes, which is the limit for UTF-8;
- All continuation bytes have the pattern 0b10? and UTF-8 self synchronizes based on detecting the leading byte;
- The correct way to parse UTF-8 is to not believe these lengths AT ALL and actually run the UTF-8 state-machine over the entire input, which can be made quite fast by leveraging bit-parallel techniques (see Daniel Lemire's work);
- The way you shoot yourself in the foot is by believing the length and skipping over those bytes: an attacker makes the last codepoint one that expects a single continuation byte but does not include the continuation byte, the fancy pantsy "optimized" parser will skip over the closing quote and decohere the parse. This is only safe to do on pre-validated input, but even then it's kind of not worth it if you have access to a SIMD accelerated UTF-8 validator
Hope this clears it up!
PS: I DMed you on Discord ;)