Hacker News new | ask | show | jobs
by pacaro 3292 days ago
Thanks for continuing your series.

This is a great illustration of how non-trivial a production quality shell is.

Parsing input is "tricky"

Each builtin needs comprehensive error handling

1 comments

IRL, shells don't do string manipulation (well, technically everything becomes string manipulation at some point, but in this context not in the normal sense of the term). Shells generally use a lexer to split inputs up into tokens (generally using regex) [0] and then make sense of the inputs using a parser (the most famous of which is called yacc [1]).

[0]: I was going to link to Bash's lex file here, but they appear to do something funky which would require a non-trivial amount of time to find, understand, and write here. So, you'll just have to take my word on this. I give you wikipedia as a substitute: https://en.wikipedia.org/wiki/Lexical_analysis

[1]: https://git.savannah.gnu.org/cgit/bash.git/tree/parse.y

The lexer for bash is inside that file, parse.y -- see yylex(), which calls read_token(). It doesn't use lex; it's written by hand.

I'm not sure what you mean that shells don't do string manipulation. Almost ALL they do is string manipulation.

That's true for the shell interpreter, which has to make sense of the input program, and for user programs, which are processing argv strings like file system paths, and stdin.

There are actually a handful of different parsers inside bash, which I mention here: http://www.oilshell.org/blog/2016/10/26.html

Brace substitution is another little parser as well. And globbing, and regex, both of which need their own parsers. (bash has its own glob parser, but some shells use libc's glob implementation). bash is really at 4-7 sublanguages in one.

The annoying thing about shell is that it makes it impossible NOT to do string manipulation in your program, because there is all this implicit stuff like word splitting.

One of my takeaways from TFA was along the lines of…

"Hmmm… he's using strtok, that's not how a real shell would work. What would a minimal shell, without scripting, pipes, redirects etc. do? Just correctly parsing legal file paths (which TFA needs to correctly implement 'cd') is well out of scope of a small article like this."

Right, a real shell obviously can't use strtok. If you're leaving out pipes, redirects, and any control flow, then separating a shell string into words for the argv[] array is fairly similar to lexing a C-escaped string (e.g. in C, Java, Python, JavaScript).

You have backslashes, single quotes, and double quotes basically. Traditionally this is done with switch statement in a loop in C.

But that is not a good approach for a real shell. Even inside double quotes you can have a fully recursive program, like:

    $ echo "hi ${v1:-A${v2:-X${v3}Y}B}"
    hi AXYB
Once you have recursion then you need some kind of parser, not just a lexer.
I've been mocked on HN for saying this before but Bash and other shells of it's ilk are programming languages in their own right. I mean sure you're dependant on the suite of tools in $PATH to do anything useful, but that's not that much different to the standard libraries that make modern languages so powerful.
I have have a hard time seeing what there is to mock about your opinion of shells. I absolutely consider them languages - better at some things, worse at others.