| HN Mirror

what do you think is a better tool to specify and parse languages?

That's unfortunately an open problem. As I mentioned here, lexing is "solved" but parsing isn't:

https://github.com/oilshell/oil/wiki/Why-Lexing-and-Parsing-...

Here are what some recent languages do:

Go: Maintain a written spec [1], with multiple hand-written parsers. (I think the main one is in Go, while gccgo has one in C++?) I believe Go used to be parsed with yacc, but they moved to a hand-written parser for error messages. It looks like [2] was an early attempt to keep it generated.

[1] https://golang.org/ref/spec

[2] https://research.swtch.com/yyerror

-----

Rust: The parser is hand-written, like Go. There's a grammar in the spec, but like Go's, it's not used to generate code, so it isn't tested. They also have an LALR(1) grammar which was suppoed to be an executable specification, but as far as I can tell that effort has not made progress recently.

It used to be done with GNU yacc but it looks like they moved to a Rust-based tool [3]

[1] https://doc.rust-lang.org/grammar.html

[2] https://github.com/rust-lang/lang-team/tree/master/working-g...

[3] https://github.com/rust-lang/wg-grammar/tree/master/grammar

-----

So basically the state of the art is fairly laborious. It's basically write down the grammar, implement it by hand, and try not to make any mistakes. If you need to implement it again, which is extremely common (e.g. for gccgo, for Rust's language server), also try not to make any mistakes.

When you need to change the language, update the spec, and change all the parsers by hand.

I've written multiple recursive-descent parsers based on grammars, and it's not hard if you have examples to test with, but mistakes can easily creep in.

Rely on users to report bugs. There is a "long tail" of programs that will tickle various corner cases.

-----

Here's the approach I took with Oil:

How to Parse Shell Like a Programming Language http://www.oilshell.org/blog/2019/02/07.html

which is basically to do as much work as possible in the lexer, use a high-level programming language translated to C++ for the parser, and to use a DSL for specifying the AST. This approach keeps the mistakes and tedium down because shell is an extremely large language syntactically. (It could be the biggest of any language except for perhaps Perl. Shell is relatively small semantically, but huge syntactically (which is why I'm preoccupied with parsing :) ).