Hacker News new | ask | show | jobs
by eatonphil 1066 days ago
Ok, one reason I can think of why you'd want to be UTF-8 aware is so that your error messages at any part of the parser could point to the exact column in the line of text. The line number you could get without being UTF-8 aware. But the column number you couldn't get without being UTF-8 aware.
2 comments

> The line number you could get without being UTF-8 aware.

Can you? Unicode has the following "new line" characters:

* U+000A Line Feed (LF) alone

* U+000D Carriage Return (CR) alone

* CRLF as one indivisible sequence

* U+000B Line Tabulation (VT) — supporting this is explicitly optional, and the main standard's newline function definition does not include it

* U+000C Form Feed (FF)

* U+0085 Next Line (NEL), an EBCDIC round-trip compatibility character

* U+2028 Line Separator (LS)

* U+2029 Paragraph Separator (PS)

My source: https://langdev.stackexchange.com/a/590/717

Yes whitespace in unicode is expansive. However, you could (and I assume most languages do) specify that a newline is \n or \r\n which are expressible in ASCII.

Maybe I'm wrong though, just an assumption about what's common.

(See for example how Go, which is Unicode aware, defines tokens: https://go.dev/ref/spec#Tokens.)

There are also other concerns depending on your threat model: if you're parsing user-generated strings you definitely want to be able to handle corrupted unicode, for security reasons, and in these scenarios the way you handle recovery if you choose to do so may aggravate exploitation.

Consider a corrupted codepoint at the end of a user generated string: will it recognize the closing quote as such, or will it assume it is part of a corrupted codepoint and try to skip over it?

So many ways to shoot yourself in the foot by "abstracting away" the formal semantics of your inputs, I think it's pretty much never worth it. (An interesting search term here is LangSec)

> Consider a corrupted codepoint at the end of a user generated string: will it recognize the closing quote as such, or will it assume it is part of a corrupted codepoint and try to skip over it?

Maybe I'm misunderstanding you, but because of how UTF-8 is a superset of ASCII, I don't believe you can misrecognize ASCII characters if that's what you mean.

You are correct that this is a detectable and entirely preventable failure, however, this is the way in which this can manifest:

- UTF-8 is a prefix-free self-synchronizing code;

- If the first byte of a UTF-8 codepoint starts with 0b0??????? then it is ASCII, and all is well;

- If the leading byte of the codepoint is 0b110? it means there is one continuation byte to follow. If its 0b1110? there are two bytes to follow, and so on up to a maximum of 4 continuation bytes, which is the limit for UTF-8;

- All continuation bytes have the pattern 0b10? and UTF-8 self synchronizes based on detecting the leading byte;

- The correct way to parse UTF-8 is to not believe these lengths AT ALL and actually run the UTF-8 state-machine over the entire input, which can be made quite fast by leveraging bit-parallel techniques (see Daniel Lemire's work);

- The way you shoot yourself in the foot is by believing the length and skipping over those bytes: an attacker makes the last codepoint one that expects a single continuation byte but does not include the continuation byte, the fancy pantsy "optimized" parser will skip over the closing quote and decohere the parse. This is only safe to do on pre-validated input, but even then it's kind of not worth it if you have access to a SIMD accelerated UTF-8 validator

Hope this clears it up!

PS: I DMed you on Discord ;)