Hacker News new | ask | show | jobs
by kevindong 3296 days ago
If you want to learn more about writing a shell from an undergraduate coursework perspective (and far closer to how bash does things), this is a chapter entirely on writing your own shell: https://www.cs.purdue.edu/homes/grr/SystemsProgrammingBook/B...

That PDF covers the basics of making your own shell (i.e. splitting input into tokens aka using a lexer, parsing the resulting tokens aka using yacc [a token parser], I/O redirection, piping, executing commands, wildcarding, interrupts, environmental variables, history, subshell, revising what you've typed without having to retype it, etc.).

Every undergraduate CS major at Purdue University is required to do the infamous shell lab (essentially, recreate csh). This project really taught me how shells work. Before doing this project, I could do the minimum in shells, but now I'm fairly competent at it.

3 comments

Since not a lot of folk knew this at my shop: wordexp() is the posix library for lexing strings "like the shell".
Note that wordexp() will also, unless explicitly told otherwise, perform command substitution and thus is capable of executing other processes. Be wary of using it on untrusted input.
I would not recommend using a lexer/parser generator for writing a shell, especially if you want to support things like backquote substitution and here-docs, since parsing and evaluating are interleaved. A recursive-descent parser is more flexible and suitable to the task.
Yes, this great article is all about how the maintainer of Bash regrets that it uses a parser generator (yacc):

http://www.aosabook.org/en/bash.html

I've mentioned this here before, but I was able to parse bash almost entirely up front, without interleaving parsing and execution. The first half of my blog [1] is about this.

To make a long story short, I use four interleaved parsers, and they ask the lexer to change state at the appropriate points. It's three separate recursive descent parsers, and then a Pratt parser for C-style arithmetic expressions.

It works very nicely, and surprisingly the algorithm is efficient, requiring only two tokens of lookahead: http://www.oilshell.org/blog/2016/11/17.html

Aside from lookahead, the lexer reads the text exactly once, not 2, 3, or 4 times.

There are two things you can't parse up front that I know of:

- Associative array syntax, but this is bash 4.0-specific: http://www.oilshell.org/blog/2016/10/20.html

- A crazy instance of runtime parsing of arithmetic expressions inside strings, AFTER variable substitution: https://github.com/oilshell/oil/issues/3 (all shells I tested implement this, not just bash)

Also there is one issue that would require arbitrary lookahead:

- Bash does arbitrary lookahead to distinguish $((1+2)) and $((echo hi)), the former being arithmetic, and the latter being a subshell inside a command sub, but it's not required by POSIX: http://www.oilshell.org/blog/2016/11/18.html

In bash, Brace substitution is really metaprogramming which can be done at parse time. You can manipulate program fragments, e.g. a{b,$((i++)),c,d}e, and it doesn't rely on any program input.

In ksh, brace substitution is done AFTER variable substitution, so it's another level of runtime parsing.

Globbing is done AFTER variable substitution in all shells.

But yes, lex and yacc are totally unsuitable for parsing shell. It's unbelievably awkward to express, and results in more code, because the parser has to be used for interactive input (the $PS2 problem), and it also should be used for command completion, e.g completing something like 'echo $(ls /b<TAB>...' .

It also forces you into parsing at runtime, as far as I can tell. The yylex() interface involves a lot of globals and the generated parsers probably don't compose as I would like.

[1] http://www.oilshell.org/blog/

It's much simpler than you would want to paint it.

  [dozzie@alojzy dozzie]$ `echo for` i in 1 2 3 4; do echo $i; done
  zsh: parse error near `do'
  [dozzie@alojzy dozzie]$ bash -c '`echo for` i in 1 2 3 4; do echo $i; done'
  bash: -c: line 0: syntax error near unexpected token `do'
  bash: -c: line 0: ``echo for` i in 1 2 3 4; do echo $i; done'
I think parser generator will be sufficient, no need to resort to I-can't-write-grammars recursive descent.

Unless you were talking about csh syntax; that one I never learned, so I don't know if it is context-free.

> Every undergraduate CS major at Purdue University is required to do the infamous shell lab

Huh. Is the course based on CMU's book "Computer Systems: A Programmer's Perspective"? The CS:APP book is used by like a hundred schools nationwide and there's a shell lab there too: http://csapp.cs.cmu.edu/3e/shlab.pdf

My school (the University of Utah) used this book for one of our courses. Worse than the shell lab was implementing malloc by hand, haha. Rough weekend right there.

The undergrad OS class at Purdue has a Malloc from scratch lab and a shell lab. Really great class, I don't know what if anything it is based on they had us buy the "Dinasaur" book but I don't remember ever using it that was a while ago now though.
My school did it as well, the shell lab was our final project