| HN Mirror

Hi,

I did a part of the incremental parsing in Menhir and the whole recovery aspect. I can try to explain a bit. My goal was Merlin (https://github.com/ocaml/merlin/), not research so there is no paper covering the work I am about to describe. Also as of today only incrementality and error message generation are part of upstream version of Menhir, but the rest should come soon.

# Incrementality, part I

The notion of incrementality that comes builtin with Menhir is slightly weaker than what you are looking for.

With Menhir, the parser state is reified and control of the parsing is given back to the user. The important point here is the departure from a Bison-like interface. The user of the parser is handled a (pure) abstract object that represents the state of the parsing. In regular parsing, this means we can store a snapshot of the parsing for each token, and resume from the first token that has changed (effectively sharing the prefix). But on the side, we can also run arbitrary analysis on a parser (for error message generation, recovery, syntactic completion, or more incrementality...).

# Incrementality, part II

Sharing prefix was good enough for our use case (parsing is not a bottleneck in the pipeline). But it turns out that a trivial extension to the parser can also solve your case.

Using the token stream and the LR automaton, you can structure the tokens as a tree:

- starts with a root node annotated by the state of the automaton

- store each token consumed as a leaf in that tree

- whenever you push on the automaton's stack, enter a sub-level of the tree (annotated by the new state), whenever you pop, return to the parent node

This gives you the knowledge that "when the parser is in state $x and $y is a prefix of the token stream, it is valid to directly reduce the subtree $z".

In a later parse, whenever you identify a known (state number, prefix) pair, you can short-circuit the parser and directly reuse the subtree of the previous parse.

If you were to write the parser by hand, this is simply memoization done on the parsing function (which is defunctionalized to a state number by the parser generator) and the prefix of token stream that is consumed by a call.

In your handwritten parser, reusing the objects from the previous parsetree amounts to memoizing a single run (and forgetting older parses). Here you are free to choose the strategy: you can memoize every run since the beginning, devise some caching policy, etc (think about a user (un)commenting blocks of code, or switching preprocessor definitions: you can get sharing of all known subtrees, if this is of any use :)).

So with part I and II, you get sharing of subtrees for free. Indeed, absolutely no work from the grammar writer has been required so far: this can all be derived by the parser generator (with the correctness guarantees that you can expect from it, as opposed to handwritten code). A last kind of sharing you might want is sharing the spine of the tree by mutating older objects. It is surely possible but tricky and I haven't investigated that at all.

# Error messages

The error message generation is part of the released Menhir version. It is described in the manual and papers by F. Pottier (like http://dl.acm.org/citation.cfm?doid=2892208.2892224, PDF available on http://gallium.inria.fr/~fpottier/biblio/pottier_abstracts.h...).

I might be biased but contrary to popular opinions I think that LR grammars are well suited to error message generation.

The prefix propery guarantees that the token pointed out by the parser is relevant to the error. The property means that there exist valid parses beginning with the prefix before this token. This is a property that doesn't hold for most backtracking parsers afaik, e.g PEG.

Knowledge of the automaton and grammar at compile time allow a precise work on error messages and separation of concerns: the tooling ensures exhaustivity of error coverage, assists in migration of error messages when the grammar is refactored, or can give an explicit description of the information available around a given error.

This is not completely free however, sometimes the grammar needs to be reengineered to carry the relevant information. But you would have to do that anyway with a handwritten parser and here the parser generator can help you....

If you have such a parser generator of course :). Menhir is the most advanced solution I know for that, and the UX is not very polished (still better than Bison). It is however a very officient workflow once you are used to it.

So LR is not the problem, but existing tools do a rather poor job at solving that kind of (real) problems.

# Recovery

In Merlin's, recovery is split in two parts.

The first is completion of the parsetree: for any prefix of a parse, it is possible to fill holes in the tree (and thus get a complete AST).

This is done by a mix of static analysis on the automaton and user guidance: for major constructions of the AST, the grammar writer provide "dummy" constructors (the node to use for erroneous expressions, incomplete statement, etc). The parser generator then checks that is has enough information to recover from any situation, or point out the cases it cannot handle.

It is not 100% free for the grammar writer, but for a grammar such as OCaml one it took less than an hour of initial work (I remember only a few minutes, but I wasn't measuring properly :)). It is also a very intuitive step as it follows the shape of the AST quite closely.

The second part is resuming the parse after an error (we don't fill the whole AST every time there is an error; before filling holes we try to look at later tokens to resume the parse). There is no big magic for that part yet, but the heuristic of indentation works quite well: constructions that have the same indentation in source code are likely to appear at the same depth in the AST, and this is used to decide when to resume consuming tokens rather than filling the AST. I have a few lead for more disciplined approaches, but the heuristic works so well that it isn't an urgency.

# Conclusion

I would say that if I had to write a parser for a language for which a reasonable LR grammar exists, I will use a (good) parser generator without hesitation. For the grammars of most programming languages, LR is the sweet spot between parsing expressivity and amenability to static analysis.