Hacker News new | ask | show | jobs
by luizfelberti 1063 days ago
Cute, now do it with UTF-8 support.

> People are terrified of parsers and parsing

And rightfully so. People who aren't afraid of them generally fail to understand all of the ways in which parsing can show fractal complexity, and will mostly stick to toy examples like this INI parser to justify their positions.

If you're gonna argue that parsing is simple, the bare minimum I'd want to see implemented is a context-sensitive grammar with unbounded lookaheads (or at the very least, that is capable of handling more than one token of lookahead), with proper support for Unicode, and actual error resilience (not what this article calls error resilience)

If you manage to do all that and can still call what you did "simple" without having completely deluded yourself, congratulations, I hope to be on your level some day.

PS1: I won't even go into the plethora of security issues originating from crappy parsers, especially those written in C

PS2: Let's also leave aside any matters related to correctness and validation of parsers, which are notoriously not by any means "simple".

PS3: Or generating decent errors for that matter.

6 comments

UTF-8 was designed so that you don't have to worry about it. Supporting UTF-8 in a parser is trivial, basically just parse as if it were ASCII but don't barf on the bytes >= 128.

As long as all your delimiter chars are ASCII, it just works.

Errors in C are usually because of missing abstractions or the wrong approach. C gives you data layout, flow control, and functions, you can go a long long way with just that.

> unbounded lookaheads

If you want to require that, you get what you deserve. But implementing it is just a matter of putting a queue of tokens in front of your parser that supports look(n) separately from consume().

> Cute, now do it with UTF-8 support.

I see this type of sentiment a lot and I'm not sure why this exists. Maybe it's because there were a bunch of formats in the past and it made it more difficult? Idk.

Anyways, I finally decided to "bite the bullet" and prepared a solid week to finally do the "nitty gritty" of writing a UTF-8 validator/logging library. Turns out, it was super easy and took me like an hour to read through the RFC and maybe 2 more hours to write a simple implementation.

For anyone that's curious, give it a read here[0], it's surprisingly readable and the format is very simple and elegant. I don't say simple as in dumb either, I say simple as in they made the problem as simple as it needs to be with no unneeded complexity, and it's a breath of fresh air.

Also, it's written in such a way that any valid ASCII is valid UTF-8. So at the very least, you can just check if you encounter any bytes with the highest bit set in the string before parsing. If that's the case you can throw an error saying you don't support UTF-8 and avoid parsing potentially invalid data (not that it's particularly difficult to validate the UTF-8 if you want to).

[0]: https://datatracker.ietf.org/doc/html/rfc3629#section-3

Have you validated that simple implementation with anything besides English?
Yes.

Let me know if you see any bugs[0], I'll add it to my regression tests.

[0]: https://github.com/ambrosiogabe/CppUtils/blob/master/single_...

Fair enough.

I was thinking about the usual cases where lowercase and uppercase character count don't match, non-latin character based languages and so on.

Can't parsers pretty easily handle UTF-8 if you just consider identifiers (and strings) as bags of bytes?
Depends on where you draw the line of what a parser is:

- If the parser is "the thing that comes after the lexer" then all of this is abstracted away by the lexer and you can just treat it as a span of bytes;

- If the parser is "everything that needs to be implemented to correctly transduce the input sequence into a tree", then you need to implement this yourself or have a lexer that handles this for you, usually done by having a tiny UTF-8 codepoint recognizing FSM in your lexer (UTF-8 is a self-synchronizing code, which makes this part easier) and ignoring the existence of graphemes.

Most people, however, shy away from implementing a parser "all the way down to the bytes" and properly handling UTF-8 as a formal language. Most lean on a lexer abstracting this away. Ditto for context-sensitivity.

Recently Rust's regex engine underwent a major overhaul, and burntsushi wrote a blog post[0] about doing the "all the way to the bytes" thing in the new regex engine, I highly recommend the read:

[0] https://blog.burntsushi.net/regex-internals/#nfa-optimizatio...

Yeah I'm saying why does your lexer actually need to be UTF-8 aware? (An actual question, because maybe I'm not thinking of some obvious case.)

Most of the lexical/syntactic elements of languages are not in UTF-8. You're looking for things like semicolons and quotes and whitespace. If you don't change the language syntax/lexical elements so that those parts stay as the ASCII subset of UTF-8 then why does your lexer need to be aware of UTF-8? It can just accumulate everything else as bytes and it doesn't matter what format the bytes are. The parser and/or codegen will do equality checks for lookups later on but that doesn't need to be UTF-8 aware either?

Am I missing something?

If you don't want to error/warn on invalid UTF-8 but instead handle it with the "garbage in, garbage out" principle, then yes you're right, treating them pure byte streams works.
Yeah that makes sense. It doesn't really strike as the job of the compiler/parser to validate UTF-8. If you've got a messed up text editor/OS environment that's going to be a problem for lots of things.
What dezgeg said is pretty much spot on, and also I think what you're describing related to "compiling the codepoints down to bytes" is in many ways equivalent to handling the UTF-8.

My opinion is, stated in a way that a TigerBeetler will resonate with ;), is I want to be able to handle radioactive levels of corruption in my inputs, and still parse them without blowing up, and issuing great error messages along the way.

Ok, one reason I can think of why you'd want to be UTF-8 aware is so that your error messages at any part of the parser could point to the exact column in the line of text. The line number you could get without being UTF-8 aware. But the column number you couldn't get without being UTF-8 aware.
> The line number you could get without being UTF-8 aware.

Can you? Unicode has the following "new line" characters:

* U+000A Line Feed (LF) alone

* U+000D Carriage Return (CR) alone

* CRLF as one indivisible sequence

* U+000B Line Tabulation (VT) — supporting this is explicitly optional, and the main standard's newline function definition does not include it

* U+000C Form Feed (FF)

* U+0085 Next Line (NEL), an EBCDIC round-trip compatibility character

* U+2028 Line Separator (LS)

* U+2029 Paragraph Separator (PS)

My source: https://langdev.stackexchange.com/a/590/717

Yes whitespace in unicode is expansive. However, you could (and I assume most languages do) specify that a newline is \n or \r\n which are expressible in ASCII.

Maybe I'm wrong though, just an assumption about what's common.

(See for example how Go, which is Unicode aware, defines tokens: https://go.dev/ref/spec#Tokens.)

There are also other concerns depending on your threat model: if you're parsing user-generated strings you definitely want to be able to handle corrupted unicode, for security reasons, and in these scenarios the way you handle recovery if you choose to do so may aggravate exploitation.

Consider a corrupted codepoint at the end of a user generated string: will it recognize the closing quote as such, or will it assume it is part of a corrupted codepoint and try to skip over it?

So many ways to shoot yourself in the foot by "abstracting away" the formal semantics of your inputs, I think it's pretty much never worth it. (An interesting search term here is LangSec)

> Consider a corrupted codepoint at the end of a user generated string: will it recognize the closing quote as such, or will it assume it is part of a corrupted codepoint and try to skip over it?

Maybe I'm misunderstanding you, but because of how UTF-8 is a superset of ASCII, I don't believe you can misrecognize ASCII characters if that's what you mean.

Do you treat non-ascii whitespace as whitespace or valid parts of a lexeme?
… or use `fgetwc()`.
C programs start out in the "C" locale, so just using fgetwc() won't work out of the box (or won't do what you expect it to do). You'll need to call setlocale("") to get the expected behavior.
> a context-sensitive grammar with unbounded lookaheads

You may want to look at chibicc, a toy (but self-hosting) C compiler written in C, which treats C grammar as if it was pretty much that.

Of course, the sane way is to not invent languages that can be naturally described only by a context-sensitive grammar with unbounded lookahead.

Agreed on all counts, except with the remark that even with sane grammars unbounded lookahead will still appear if you want to have IDE-grade error resilience.

But I wholeheartedly agree with the sentiment of "don't make the grammar look like Scala" <3

The “IDE-grade” error resilience can be approached in many ways, and unbounded lookahead is often unnecessary. The main problem you want to solve is the problem of recovering from an error, and parsing more of the file correctly after you encounter an error. One way you can do this is by finding statement or declaration boundaries, which can be done in surprisingly simple ways.

Depending on the language.

I mostly agree with this, but what I mean with the unbounded lookahead part of it is that bounding the amount of speculation (or lookaheads/backtracking) is equivalent to limiting the "size" of the error you can recover from.

You should definitely have bounds though, but the point is that if it's too low you might give up on the input too soon.

I can parse all of GCC's C torture tests with basic UTF8 support in my "toy" PEG-based parser, does that count?
Is your parser available anywhere?
> I hope to be on your level some day

Sometimes you just haven't met the right abstraction yet. I'll be that guy talking about parser combinators hopefully before the rest of this thread fills up with them. I don't think my parser does everything in your bare minimum (I haven't really thought about utf-8!) but it does do some other pretty advanced stuff. For example it leans on white-space pretty hard to figure things out. No curly braces or semicolons, and parentheses are only for precedence, not function application.

What my parser does do:

* backtracking/alternatives

* Some context-sensitivity, in-so-far as it can tell a negate from a minus.

Where it got a little hard:

* I realised I was parsing division the wrong way. a/b/c/d became a/(b/(c/d)), not the other way around.

Where it got medium hard: * Distinguishing unary minus from binary minus. I thought it would be really hard, but I only needed to look at the previous token to decide whether something was a TokNegate or a TokMinus.

Where it got hard:

* White-space/indentation sensitivity. I needed to first calculate the line-breaks and make that information (gathered during lexing) available during parsing.

Where it got really hard:

* LEARNING how to factor out the left-recursion. There were times when I literally thought it was impossible. I knew about 'precedence' in the back of my mind, but I didn't realise how the concept mapped to the code yet. By example: one sumExpression is (many or one productExpressions separated-by-'+'), and one productExpression is (many or one unaryExpression separated-by-'*'), and so on. You don't end up in an infinite-parse-loop if you try to parse the least-tightly-binding expressions first (which just seemed so counterintuitive that I guess I never tried?).

But I've yet to say why I like parser combinators so much (and think they're at least the 'simplest' way to do things, if not 'simple'):

You get to write code which looks like the bnf definition !

Just like TFA I'll take a lua example[1]

    var ::=  Name  |  prefixexp `[´ exp `]´  |  prefixexp `.´ Name
I would code this something like:

    var <- name <|> case2 <|> case 3
    where
    case2 = do
        pe <- prefixexp
        e  <- char '[' *> exp <* char ']'
        return (pe, e)

    case3 = do
        pe <- prefixexp
        _  <- char '.'
        n  <- name
        return (pe, n)
It more or less maps exactly onto the BNF, and in the above case, the extra complexity came from capturing the subexpressions and returning them to the caller. If I wrote a grammar to simply accept/deny its input (rather than trying to build an AST out of it), it could resemble the BNF even more:

Bnf definition vs. executable code:

    var ::= Name   |   prefixexp         `[´    exp          `]´  |   prefixexp         `.´    Name  
    var   = name  <|> (prefixexp >> char '[' >> exp >> char ']') <|> (prefixexp >> char '.' >> name)
I will say one other thing about the simplicity, which is - I didn't use an existing parser combinator library. They're simple enough to roll your own. There's only one trap which I can think of, which is where to draw the line on automatic-backtracking. I.e. Should the caller explicitly need to insert 'try's to enable backtracking.

> Cute, now do it with UTF-8 support.

Ironically I think this is the one feature where I'd prefer to be in C. C's approach with bytes is perfectly forward-compatible. A higher-level language might be more opinionated about its String type (restricting what you can or can't accept with your parser) or have funny definitions about length().

[1] http://parrot.github.io/parrot-docs0/0.4.7/html/languages/lu...*