Hacker News new | ask | show | jobs
by mrkeen 500 days ago
Imo this missed the punchline.

I banged my head on the table for days trying to fix a left-recursive grammar, just by moving bits around and hoping.

The key is precedence! Parse loose expressions first (plus, minus, etc), these loose expressions can in turn call into tighter expressions (mul, div), and so on down to parentheses.

Parentheses do form a loop - that is, they call all the way back to the top - but at long as the parser makes progress i.e. can only proceed if it moves past a '(', no infinite loops and you're golden.

4 comments

Debugging a complex PEG is a nightmarish task. I use various tools, but I couldn't find anything out there that will let you set a breakpoint in a file that's being parsed and let you explore the parsing state.

The most useful tools I found were adjacent to the cpp-peglib library: https://github.com/yhirose/cpp-peglib

This comes with a PEG playground: https://yhirose.github.io/cpp-peglib/

I really liked pegdebug: https://mqnc.github.io/pegdebug/

With sample output here: https://mqnc.github.io/pegdebug/example/output.html

pegdebug is nice for small sets of data, but it rapidly gets swamped by anything over about 50 lines.

If anyone has other suggestions for debugging PEGs, please reply and let me know,.

I think you're semi-right here, but the term precedence is ambiguous.

Left recursion causes rules to be ambiguous. In `<S> | a`, it's always ambiguous if the input has ended. Thus the data structure that would be traversed can be a circular graph. In `<S> := a | <S>, S := ε', the rules are unambiguous and the data structure that is traversed has the shape of a tree.

You get precedence rules by defining the shape of this tree, thus which rules will be visited before which.

However it doesn't have to be strict precedence. Some rules can have an ambiguous order as long as the execution still follows the shape of a tree.

For example, in my simple rule above, it doesn't matter if you visit the ε rule or the other rule first in your implementation (although it's less efficient to visit ε first).

So, I think precedence is a side-effect of doing this transformation.

I like a lot of things about PEG but the way they support precedence is for the birds. I'd really love to see a PEG toolkit built as if developer experience matters but the problem with parser generators is that anybody who can write a parsed generator understands how to work around the problems with parser generators -- so we don't get parser generators that fit the needs of developers who don't know a lot of CS theory.
Precedence is more of a way to remove ambiguity from grammar. Grammars are just hard to figure out in one go.
I appreciate the dismissal, but I stand by what I said.

I used to think 'precedence was just for removing ambiguity', so I didn't pay attention to it.

Then I discovered it was the key to structuring grammars to avoid left recursion, and everything changed for me.

I've used it successfully in the past and I'll do so in the future.

I think you're not using the word precedence properly. It seems as if you're trying to explicitly instruct a top-down parser.

E.g., the precedence in these two grammars (for simple arithmetic expressions) is identical

    S -> F | S + F
    F -> T | F * T
    T -> 0 | 1 | 2 | ...

    S -> F | F + S
    F -> T | T * F
    T -> 0 | 1 | 2 | ...
but one recurses on the left side, and the other on the right.
> It seems as if you're trying to explicitly instruct a top-down parser.

My mistake, that didn't even occur to me! I only ever use parser combinators so I just read grammars as if they were the code, and vice-versa.

e.g. you could implement the grammar on the upper line as the code on the lower line:

  S -> F  |   F          +     S
  s  = f <|> (f >> char '+' >> s)
The two grammars you posted both 'look good' just by eyeballing them, neither needs 'fixing' (per the article title). They have identical precedence and identical don't-recurse-infinitely properties.
A grammar is not the language. A language can be described by infinitely many grammars. That also goes for precedence. Parsing theory describes which algorithms can handle which grammars, and how efficiently they can do that. It allows you to ignore the details of parsing.
I see, I'd like to give it a try then. I'll try adding some precedence here. I don't know how, though.

This calls for some reading...

I put it off as well, I figured precedence and associativity is something I'd learn later after I figured out how to stop my **ing grammars from looping forever. But it was right there in front of me!

When I looked at someone else's working code, they'd laid it out so that each layer could only call into the same layer or the next layer down:

  parseSum     = ...
  parseProduct = ...
  parseTerm    = ...
which should eliminate the mutual-infinite-recursion case, since the control flow can't ever go upward. And it didn't look like a co-incidence that they'd laid the code out from low-precedence mathematical operators to high-precedence mathematical operators.

The code I ended up using is something like

  parseSum = parseProduct then many('+' then parseProduct) // where 'many' means 0 or more.
But if you don't have 'many' in your grammar syntax, it looks like this would work instead:

  parseSum = parseProduct + parseSum
But the above is wrong for two reasons. Firstly, you need to have a '+', so there's no way to parse an expression that doesn't have a '+' in it. But it also right-associates and you end up with (1+(2+(3+4))) which is probably fine for addition (give or take some C-int-overflow-rules), but breaks division. I think you can fix both by putting the terminal case first, and then left-recursing after that:

  parseSum = parseProduct | parseSum + parseProduct
Then there's only one remaining issue to deal with - what do you do when you actually want to loop? Sums can contain products, and products can contain sums (as long as they're in parens.) Well, parens bind tighter than any other precedence, so they go straight to the bottom:

  parseSum     = ...
  parseProduct = ...
  parseTerm    = ...
  parseParens  = '(' parseSum ')'
So we've broken the no-upward-control-flow rule, but is it an infinite loop? The only case where control flow can go upward is if you consume a '(' first, so I say no.