Hacker News new | ask | show | jobs
by billconan 2238 days ago
Thank you very much for the references. They are very good reads!

In the article about Python, the author said:

> Most languages of interest are context sensitive, and yet we are trying to use inappropriate tools (context-free grammars and context-free parser generators) to specify and parse them. This leads to bad specifications and bad implementations.

Then what do you think is a better tool to specify and parse languages?

1 comments

what do you think is a better tool to specify and parse languages?

That's unfortunately an open problem. As I mentioned here, lexing is "solved" but parsing isn't:

https://github.com/oilshell/oil/wiki/Why-Lexing-and-Parsing-...

Here are what some recent languages do:

Go: Maintain a written spec [1], with multiple hand-written parsers. (I think the main one is in Go, while gccgo has one in C++?) I believe Go used to be parsed with yacc, but they moved to a hand-written parser for error messages. It looks like [2] was an early attempt to keep it generated.

[1] https://golang.org/ref/spec

[2] https://research.swtch.com/yyerror

-----

Rust: The parser is hand-written, like Go. There's a grammar in the spec, but like Go's, it's not used to generate code, so it isn't tested. They also have an LALR(1) grammar which was suppoed to be an executable specification, but as far as I can tell that effort has not made progress recently.

It used to be done with GNU yacc but it looks like they moved to a Rust-based tool [3]

[1] https://doc.rust-lang.org/grammar.html

[2] https://github.com/rust-lang/lang-team/tree/master/working-g...

[3] https://github.com/rust-lang/wg-grammar/tree/master/grammar

-----

So basically the state of the art is fairly laborious. It's basically write down the grammar, implement it by hand, and try not to make any mistakes. If you need to implement it again, which is extremely common (e.g. for gccgo, for Rust's language server), also try not to make any mistakes.

When you need to change the language, update the spec, and change all the parsers by hand.

I've written multiple recursive-descent parsers based on grammars, and it's not hard if you have examples to test with, but mistakes can easily creep in.

Rely on users to report bugs. There is a "long tail" of programs that will tickle various corner cases.

-----

Here's the approach I took with Oil:

How to Parse Shell Like a Programming Language http://www.oilshell.org/blog/2019/02/07.html

which is basically to do as much work as possible in the lexer, use a high-level programming language translated to C++ for the parser, and to use a DSL for specifying the AST. This approach keeps the mistakes and tedium down because shell is an extremely large language syntactically. (It could be the biggest of any language except for perhaps Perl. Shell is relatively small semantically, but huge syntactically (which is why I'm preoccupied with parsing :) ).