Hacker News new | ask | show | jobs
by userbinator 3287 days ago
I would not recommend using a lexer/parser generator for writing a shell, especially if you want to support things like backquote substitution and here-docs, since parsing and evaluating are interleaved. A recursive-descent parser is more flexible and suitable to the task.
2 comments

Yes, this great article is all about how the maintainer of Bash regrets that it uses a parser generator (yacc):

http://www.aosabook.org/en/bash.html

I've mentioned this here before, but I was able to parse bash almost entirely up front, without interleaving parsing and execution. The first half of my blog [1] is about this.

To make a long story short, I use four interleaved parsers, and they ask the lexer to change state at the appropriate points. It's three separate recursive descent parsers, and then a Pratt parser for C-style arithmetic expressions.

It works very nicely, and surprisingly the algorithm is efficient, requiring only two tokens of lookahead: http://www.oilshell.org/blog/2016/11/17.html

Aside from lookahead, the lexer reads the text exactly once, not 2, 3, or 4 times.

There are two things you can't parse up front that I know of:

- Associative array syntax, but this is bash 4.0-specific: http://www.oilshell.org/blog/2016/10/20.html

- A crazy instance of runtime parsing of arithmetic expressions inside strings, AFTER variable substitution: https://github.com/oilshell/oil/issues/3 (all shells I tested implement this, not just bash)

Also there is one issue that would require arbitrary lookahead:

- Bash does arbitrary lookahead to distinguish $((1+2)) and $((echo hi)), the former being arithmetic, and the latter being a subshell inside a command sub, but it's not required by POSIX: http://www.oilshell.org/blog/2016/11/18.html

In bash, Brace substitution is really metaprogramming which can be done at parse time. You can manipulate program fragments, e.g. a{b,$((i++)),c,d}e, and it doesn't rely on any program input.

In ksh, brace substitution is done AFTER variable substitution, so it's another level of runtime parsing.

Globbing is done AFTER variable substitution in all shells.

But yes, lex and yacc are totally unsuitable for parsing shell. It's unbelievably awkward to express, and results in more code, because the parser has to be used for interactive input (the $PS2 problem), and it also should be used for command completion, e.g completing something like 'echo $(ls /b<TAB>...' .

It also forces you into parsing at runtime, as far as I can tell. The yylex() interface involves a lot of globals and the generated parsers probably don't compose as I would like.

[1] http://www.oilshell.org/blog/

It's much simpler than you would want to paint it.

  [dozzie@alojzy dozzie]$ `echo for` i in 1 2 3 4; do echo $i; done
  zsh: parse error near `do'
  [dozzie@alojzy dozzie]$ bash -c '`echo for` i in 1 2 3 4; do echo $i; done'
  bash: -c: line 0: syntax error near unexpected token `do'
  bash: -c: line 0: ``echo for` i in 1 2 3 4; do echo $i; done'
I think parser generator will be sufficient, no need to resort to I-can't-write-grammars recursive descent.

Unless you were talking about csh syntax; that one I never learned, so I don't know if it is context-free.