| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by eatonphil 1064 days ago
	Can't parsers pretty easily handle UTF-8 if you just consider identifiers (and strings) as bags of bytes?

1 comments

luizfelberti 1064 days ago

Depends on where you draw the line of what a parser is:

- If the parser is "the thing that comes after the lexer" then all of this is abstracted away by the lexer and you can just treat it as a span of bytes;

- If the parser is "everything that needs to be implemented to correctly transduce the input sequence into a tree", then you need to implement this yourself or have a lexer that handles this for you, usually done by having a tiny UTF-8 codepoint recognizing FSM in your lexer (UTF-8 is a self-synchronizing code, which makes this part easier) and ignoring the existence of graphemes.

Most people, however, shy away from implementing a parser "all the way down to the bytes" and properly handling UTF-8 as a formal language. Most lean on a lexer abstracting this away. Ditto for context-sensitivity.

Recently Rust's regex engine underwent a major overhaul, and burntsushi wrote a blog post[0] about doing the "all the way to the bytes" thing in the new regex engine, I highly recommend the read:

[0] https://blog.burntsushi.net/regex-internals/#nfa-optimizatio...

link

eatonphil 1063 days ago

Yeah I'm saying why does your lexer actually need to be UTF-8 aware? (An actual question, because maybe I'm not thinking of some obvious case.)

Most of the lexical/syntactic elements of languages are not in UTF-8. You're looking for things like semicolons and quotes and whitespace. If you don't change the language syntax/lexical elements so that those parts stay as the ASCII subset of UTF-8 then why does your lexer need to be aware of UTF-8? It can just accumulate everything else as bytes and it doesn't matter what format the bytes are. The parser and/or codegen will do equality checks for lookups later on but that doesn't need to be UTF-8 aware either?

Am I missing something?

link

dezgeg 1063 days ago

If you don't want to error/warn on invalid UTF-8 but instead handle it with the "garbage in, garbage out" principle, then yes you're right, treating them pure byte streams works.

link

eatonphil 1063 days ago

Yeah that makes sense. It doesn't really strike as the job of the compiler/parser to validate UTF-8. If you've got a messed up text editor/OS environment that's going to be a problem for lots of things.

link

luizfelberti 1063 days ago

What dezgeg said is pretty much spot on, and also I think what you're describing related to "compiling the codepoints down to bytes" is in many ways equivalent to handling the UTF-8.

My opinion is, stated in a way that a TigerBeetler will resonate with ;), is I want to be able to handle radioactive levels of corruption in my inputs, and still parse them without blowing up, and issuing great error messages along the way.

link

eatonphil 1063 days ago

Ok, one reason I can think of why you'd want to be UTF-8 aware is so that your error messages at any part of the parser could point to the exact column in the line of text. The line number you could get without being UTF-8 aware. But the column number you couldn't get without being UTF-8 aware.

link

TRiG_Ireland 1063 days ago

> The line number you could get without being UTF-8 aware.

Can you? Unicode has the following "new line" characters:

* U+000A Line Feed (LF) alone

* U+000D Carriage Return (CR) alone

* CRLF as one indivisible sequence

* U+000B Line Tabulation (VT) — supporting this is explicitly optional, and the main standard's newline function definition does not include it

* U+000C Form Feed (FF)

* U+0085 Next Line (NEL), an EBCDIC round-trip compatibility character

* U+2028 Line Separator (LS)

* U+2029 Paragraph Separator (PS)

My source: https://langdev.stackexchange.com/a/590/717

link

eatonphil 1063 days ago

Yes whitespace in unicode is expansive. However, you could (and I assume most languages do) specify that a newline is \n or \r\n which are expressible in ASCII.

Maybe I'm wrong though, just an assumption about what's common.

(See for example how Go, which is Unicode aware, defines tokens: https://go.dev/ref/spec#Tokens.)

link

luizfelberti 1063 days ago

There are also other concerns depending on your threat model: if you're parsing user-generated strings you definitely want to be able to handle corrupted unicode, for security reasons, and in these scenarios the way you handle recovery if you choose to do so may aggravate exploitation.

Consider a corrupted codepoint at the end of a user generated string: will it recognize the closing quote as such, or will it assume it is part of a corrupted codepoint and try to skip over it?

So many ways to shoot yourself in the foot by "abstracting away" the formal semantics of your inputs, I think it's pretty much never worth it. (An interesting search term here is LangSec)

link

eatonphil 1063 days ago

> Consider a corrupted codepoint at the end of a user generated string: will it recognize the closing quote as such, or will it assume it is part of a corrupted codepoint and try to skip over it?

Maybe I'm misunderstanding you, but because of how UTF-8 is a superset of ASCII, I don't believe you can misrecognize ASCII characters if that's what you mean.

link

luizfelberti 1063 days ago

You are correct that this is a detectable and entirely preventable failure, however, this is the way in which this can manifest:

- UTF-8 is a prefix-free self-synchronizing code;

- If the first byte of a UTF-8 codepoint starts with 0b0??????? then it is ASCII, and all is well;

- If the leading byte of the codepoint is 0b110? it means there is one continuation byte to follow. If its 0b1110? there are two bytes to follow, and so on up to a maximum of 4 continuation bytes, which is the limit for UTF-8;

- All continuation bytes have the pattern 0b10? and UTF-8 self synchronizes based on detecting the leading byte;

- The correct way to parse UTF-8 is to not believe these lengths AT ALL and actually run the UTF-8 state-machine over the entire input, which can be made quite fast by leveraging bit-parallel techniques (see Daniel Lemire's work);

- The way you shoot yourself in the foot is by believing the length and skipping over those bytes: an attacker makes the last codepoint one that expects a single continuation byte but does not include the continuation byte, the fancy pantsy "optimized" parser will skip over the closing quote and decohere the parse. This is only safe to do on pre-validated input, but even then it's kind of not worth it if you have access to a SIMD accelerated UTF-8 validator

Hope this clears it up!

PS: I DMed you on Discord ;)

link

duped 1063 days ago

Do you treat non-ascii whitespace as whitespace or valid parts of a lexeme?

link

classified 1063 days ago

… or use `fgetwc()`.

link

spc476 1063 days ago

C programs start out in the "C" locale, so just using fgetwc() won't work out of the box (or won't do what you expect it to do). You'll need to call setlocale("") to get the expected behavior.

link