Hacker News new | ask | show | jobs
by Drup 3382 days ago
While I agree with you those 3 points are extremely important, it turns out there is at least one parser generator that can do all of it: http://gallium.inria.fr/~fpottier/menhir/

It supports both incremental parsing and an API to inspect and recover incomplete ASTs (which powers Merlin, the IDE-like thing for OCaml). It provides stellar debugging features for ambiguous grammars and a way to have good error messages (which is used in compcert's C parser and facebook's reason).

So, it's not impossible. Most parser generators are not that good, though.

3 comments

Yeah, so we can spend a long time trying to find a parser generator that can solve all of our problems in the language that we use, which may or may not exist. Or you could just spend the same time to just write the parser. It turns out that writing a parser just isn't that hard, and in the time that you spend wavering on which parser generator to use, you could have most of your parser done.
> and in the time that you spend wavering on which parser generator to use, you could have most of your parser done.

Maybe, but if you want all of the features mentioned above (incremental parsing etc.), then writing a parser becomes much more complicated.

Or you could take the middle ground and use a library meant to make it easier to create hand built parsers. as opposed to a library that generates a parser for you.

* Repo: https://github.com/SAP/chevrotain

* Online Playground: http://sap.github.io/chevrotain/playground/

This handles the repetitive parts such as lookahead functions creation or Parse Tree building, while also providing advanced features such as error recovery or syntactic content assist (auto complete).

* ECMAScript family of languages supported (JavaScript/TypeScript/CoffeeScript).

Not really, it is rather easy to hack features into a recursive descent parser. That is is the main reason almost all production compilers are written that way.

With a parser generator, it quickly becomes a game of designing your language and its semantics around the limitations of the generator you are using.

That is super impressive! I can't find the part on incomplete AST or AST reuse in their reference docs though.
Details about the "incremental" mode are listed in the documentation PDF[0] at section 9.2

Here are the first couple of paragraphs:

"In this API, control is inverted. The parser does not have access to the lexer. Instead, when the parser needs the next token, it stops and returns its current state to the user. The user is then responsible for obtaining this token (typically by invoking the lexer) and resuming the parser from that state. The directory demos/calc-incremental contains a demo that illustrates the use of the incremental API.

This API is “incremental” in the sense that the user has access to a sequence of the intermediate states of the parser. Assuming that semantic values are immutable, a parser state is a persistent data structure: it can be stored and used multiple times, if desired. This enables applications such as “live parsing”, where a buffer is continuously parsed while it is being edited. The parser can be re-started in the middle of the buffer whenever the user edits a character. Because two successive parser states share most of their data in memory, a list of n successive parser states occupies only O(n) space in memory."

There does not appear to be a specific mention of having the partial AST available.

[0] - linked from their front page and available at http://gallium.inria.fr/~fpottier/menhir/manual.pdf

Hi,

I did a part of the incremental parsing in Menhir and the whole recovery aspect. I can try to explain a bit. My goal was Merlin (https://github.com/ocaml/merlin/), not research so there is no paper covering the work I am about to describe. Also as of today only incrementality and error message generation are part of upstream version of Menhir, but the rest should come soon.

# Incrementality, part I

The notion of incrementality that comes builtin with Menhir is slightly weaker than what you are looking for.

With Menhir, the parser state is reified and control of the parsing is given back to the user. The important point here is the departure from a Bison-like interface. The user of the parser is handled a (pure) abstract object that represents the state of the parsing. In regular parsing, this means we can store a snapshot of the parsing for each token, and resume from the first token that has changed (effectively sharing the prefix). But on the side, we can also run arbitrary analysis on a parser (for error message generation, recovery, syntactic completion, or more incrementality...).

# Incrementality, part II

Sharing prefix was good enough for our use case (parsing is not a bottleneck in the pipeline). But it turns out that a trivial extension to the parser can also solve your case.

Using the token stream and the LR automaton, you can structure the tokens as a tree:

- starts with a root node annotated by the state of the automaton

- store each token consumed as a leaf in that tree

- whenever you push on the automaton's stack, enter a sub-level of the tree (annotated by the new state), whenever you pop, return to the parent node

This gives you the knowledge that "when the parser is in state $x and $y is a prefix of the token stream, it is valid to directly reduce the subtree $z".

In a later parse, whenever you identify a known (state number, prefix) pair, you can short-circuit the parser and directly reuse the subtree of the previous parse.

If you were to write the parser by hand, this is simply memoization done on the parsing function (which is defunctionalized to a state number by the parser generator) and the prefix of token stream that is consumed by a call.

In your handwritten parser, reusing the objects from the previous parsetree amounts to memoizing a single run (and forgetting older parses). Here you are free to choose the strategy: you can memoize every run since the beginning, devise some caching policy, etc (think about a user (un)commenting blocks of code, or switching preprocessor definitions: you can get sharing of all known subtrees, if this is of any use :)).

So with part I and II, you get sharing of subtrees for free. Indeed, absolutely no work from the grammar writer has been required so far: this can all be derived by the parser generator (with the correctness guarantees that you can expect from it, as opposed to handwritten code). A last kind of sharing you might want is sharing the spine of the tree by mutating older objects. It is surely possible but tricky and I haven't investigated that at all.

# Error messages

The error message generation is part of the released Menhir version. It is described in the manual and papers by F. Pottier (like http://dl.acm.org/citation.cfm?doid=2892208.2892224, PDF available on http://gallium.inria.fr/~fpottier/biblio/pottier_abstracts.h...).

I might be biased but contrary to popular opinions I think that LR grammars are well suited to error message generation.

The prefix propery guarantees that the token pointed out by the parser is relevant to the error. The property means that there exist valid parses beginning with the prefix before this token. This is a property that doesn't hold for most backtracking parsers afaik, e.g PEG.

Knowledge of the automaton and grammar at compile time allow a precise work on error messages and separation of concerns: the tooling ensures exhaustivity of error coverage, assists in migration of error messages when the grammar is refactored, or can give an explicit description of the information available around a given error.

This is not completely free however, sometimes the grammar needs to be reengineered to carry the relevant information. But you would have to do that anyway with a handwritten parser and here the parser generator can help you....

If you have such a parser generator of course :). Menhir is the most advanced solution I know for that, and the UX is not very polished (still better than Bison). It is however a very officient workflow once you are used to it.

So LR is not the problem, but existing tools do a rather poor job at solving that kind of (real) problems.

# Recovery

In Merlin's, recovery is split in two parts.

The first is completion of the parsetree: for any prefix of a parse, it is possible to fill holes in the tree (and thus get a complete AST).

This is done by a mix of static analysis on the automaton and user guidance: for major constructions of the AST, the grammar writer provide "dummy" constructors (the node to use for erroneous expressions, incomplete statement, etc). The parser generator then checks that is has enough information to recover from any situation, or point out the cases it cannot handle.

It is not 100% free for the grammar writer, but for a grammar such as OCaml one it took less than an hour of initial work (I remember only a few minutes, but I wasn't measuring properly :)). It is also a very intuitive step as it follows the shape of the AST quite closely.

The second part is resuming the parse after an error (we don't fill the whole AST every time there is an error; before filling holes we try to look at later tokens to resume the parse). There is no big magic for that part yet, but the heuristic of indentation works quite well: constructions that have the same indentation in source code are likely to appear at the same depth in the AST, and this is used to decide when to resume consuming tokens rather than filling the AST. I have a few lead for more disciplined approaches, but the heuristic works so well that it isn't an urgency.

# Conclusion

I would say that if I had to write a parser for a language for which a reasonable LR grammar exists, I will use a (good) parser generator without hesitation. For the grammars of most programming languages, LR is the sweet spot between parsing expressivity and amenability to static analysis.

Nice! By the way, I think the Dragon Book needs an update on incremental parsing :)