Hacker News new | ask | show | jobs
by ufo 326 days ago
A middle ground that I think is sometimes useful is to use an LR parser generator to check if the grammar is ambiguous, but use recursive descent for the actual implementation. Since we won't actually use any code from the LR parser generator, you can pick whatever one you prefer regardless of the programming language.
2 comments

It's trivial to get a recursive descent parser without any ambiguities hidden in it if you don't go the PEG route (which is only unambiguous because you always pick the first choice, which might not be what you want). Just always branch on the current token. No way to have an ambiguity like that.
I disagree. When writing recursive descent by hand, it's easy to miss an ambiguity because of miscomputed FIRST and FOLLOW sets.

In practice most recursive descent parsers use if-else liberally. Thus, they effectively work like pegs where the first match wins (but without the limited backtracking of pegs). They are deterministic in the sense that the implementation always returns a predictable result. But they are still ambiguous in the sense that this behavior might not have been planned by the language designer, and the ambiguity may not have been resolved how the programmer expected.

Without a comprehensive test suite you can easily break a recursive descent parser. By adding code into some function to handle something new, you can accidentally prevent some existing syntax from being recognized.

It has been my eperience that if you have a LALR parser that reports no errors at generation time, and you add something such that there are still no errors, you've not ruined any existing syntax. That could be a theorem.

Don't compute first and follow sets. Just branch on the current token. It is trivially unambiguous since 1 token = 1 branch. Expressions can be dealt with using precedence climbing / pratt, which still just amounts to branching on the current token after the "lhs" has been computed.

If the language doesn't fit this LL1 + operator precedence mold then I would not use a recursive descent parser.

The whole issue is that, without computing first and follow sets, it's easy to make a mistake and think that the grammar is LL(1) when it actually isn't. When that happens, the if statement does you no good. It'll be deterministic, but will silently mask the ambiguity.
The only case I can think of is something like the dangling else problem or similar, which is very easy to detect.

You need to have a recursive call to the same rule or a "parent" rule, followed by an optional token match (note that this is already more than a simple branch on the current token since it is effectively a 1 token backtrack).

If there's any extra token match inbetween (like mandatory {}) you've already dodged the problem.

So I agree a mistake is possible, but "easy" I would not call it. It only appears in very specific conditions. The issue is way more prevalent in general PEGs.

I've been having thoughts along these lines. Earley parsers match recursive descent really nicely. In my head there'd by an Earley parser "oracle": you'd tell the oracle about the operations you've performed (in terms of terminal consumption); and, then, you can ask the oracle which recursive descent subfunctions are safe to call (based on the prediction phase).