| HN Mirror

Yes, this great article is all about how the maintainer of Bash regrets that it uses a parser generator (yacc):

http://www.aosabook.org/en/bash.html

I've mentioned this here before, but I was able to parse bash almost entirely up front, without interleaving parsing and execution. The first half of my blog [1] is about this.

To make a long story short, I use four interleaved parsers, and they ask the lexer to change state at the appropriate points. It's three separate recursive descent parsers, and then a Pratt parser for C-style arithmetic expressions.

It works very nicely, and surprisingly the algorithm is efficient, requiring only two tokens of lookahead: http://www.oilshell.org/blog/2016/11/17.html

Aside from lookahead, the lexer reads the text exactly once, not 2, 3, or 4 times.

There are two things you can't parse up front that I know of:

- Associative array syntax, but this is bash 4.0-specific: http://www.oilshell.org/blog/2016/10/20.html

- A crazy instance of runtime parsing of arithmetic expressions inside strings, AFTER variable substitution: https://github.com/oilshell/oil/issues/3 (all shells I tested implement this, not just bash)

Also there is one issue that would require arbitrary lookahead:

- Bash does arbitrary lookahead to distinguish $((1+2)) and $((echo hi)), the former being arithmetic, and the latter being a subshell inside a command sub, but it's not required by POSIX: http://www.oilshell.org/blog/2016/11/18.html

In bash, Brace substitution is really metaprogramming which can be done at parse time. You can manipulate program fragments, e.g. a{b,$((i++)),c,d}e, and it doesn't rely on any program input.

In ksh, brace substitution is done AFTER variable substitution, so it's another level of runtime parsing.

Globbing is done AFTER variable substitution in all shells.

But yes, lex and yacc are totally unsuitable for parsing shell. It's unbelievably awkward to express, and results in more code, because the parser has to be used for interactive input (the $PS2 problem), and it also should be used for command completion, e.g completing something like 'echo $(ls /b<TAB>...' .

It also forces you into parsing at runtime, as far as I can tell. The yylex() interface involves a lot of globals and the generated parsers probably don't compose as I would like.

[1] http://www.oilshell.org/blog/