Hacker News new | ask | show | jobs
by arendtio 2700 days ago
In fact, I don't like people optimizing shell scripts for performance. I mean, shell scripts are slow by design and if you need something fast, you choose the wrong technology in the first place.

Instead, shell script should be optimized for readability and portability and I think it is much easier to understand something like 'read | change >write' than 'change <read >write'. So I like to write pipelines like this:

  cat foo.txt \
    | grep '^x' \
    | sed 's/a/b/g' \
    | awk '{print $2}' \
    | wc -l >bar.txt
It might be not the most efficient processing method, but I think it is quite readable.

For those who disagree with me: You might find the pure-bash-bible [1] valuable. While I admire their passion for shell scripts, I think they are optimizing to the wrong end. I would be more a fan of something along the lines of 'readable-POSIX-shell-bible' ;-)

[1]: https://github.com/dylanaraps/pure-bash-bible

10 comments

IMHO, shell scripts are a minefield and if you want something readable and portable, this is also the wrong technology. They are convenient though. They are like the Excel macros of the UNIX world.

Now back to the topic of "cat", which is a great example of why shell scripts are minefields.

Replace "foo.txt" with a user supplied variable, let's call it "$F". It becomes cat $F | blah_blah... I mean cat "$F" | blah_blah, first trap, but everyone knows that.

Now, if F='-n', second trap. What you think is a file will be considered an option and cat will wait for user input, like when no file is given. Ok, so you need to do cat -- "$F" | blah_blah.

That should be OK in every case now, but remember that "cat" is just another executable, or maybe a builtin. For some reason, on your system "cat --" may not work, or some asshat may have added "." in your PATH and you may be in a directory with a file named "cat". Or maybe some alias that decides to add color.

There are other things to consider, like your locale that may mess up you output with comas instead of decimal points and unicode characters. For that reason, you need to be very careful every time you call a command and even more so if you pipe the output.

For that reason, I avoid using "cat" in scripts. It is an extra command call and all the associated headaches I can do without.

> Now, if F='-n', second trap

You're not wrong, but I think it's worth pointing out that's a trap that comes up any time you exec another program, whether it's from shell or python. I can't reasonably expect `subprocess.run(["cat", somevar])` to work if `somevar = "-n"`.

(Now, obviously, I'm not going to "cat" from python, but I might "kubectl" or something else that requires care around the arguments)

> Replace "foo.txt" with a user supplied variable, let's call it "$F". It becomes cat $F | blah_blah... I mean cat "$F" | blah_blah, first trap, but everyone knows that.

I think that you forgot to edit the "I mean" to "echo $F" :)

I agree with the sentiment, but my critique applies so generally that it must be noted: if a command accepts a filename as a parameter, you should absolutely pass it as a parameter rather than `cat` it over stdin.

For example, you can write this pipeline as:

    grep '^x' foo.txt \
        | sed 's/a/b/g' \
        | awk '{print $2}' \
        | wc -l > bar.txt
This is by no means scientific, but I've got a LaTeX document open right now. A quick `time` says:

    $ time grep 'what' AoC.tex
    real    0m0.045s
    user    0m0.000s
    sys     0m0.000s

    $ time cat AoC.tex | grep what
    real    0m0.092s
    user    0m0.000s
    sys     0m0.047s
Anecdotally, I've witnessed small pipelines that absolutely make sense totally thrash a system because of inappropriate uses of `cat`. When you `cat` a file, the OS must (1) `fork` and `exec`, (2) copy the file to `cat`'s memory, (3) copy the contents of `cat`'s memory to the pipe, and (4) copy the contents of the pipe to `grep`'s memory. That's a whole lot of copying for large files -- especially when the first command grep in the sequence usually performs some major kind of reduction on the input data!
In my opinion, it's perfectly fine either way unless you're worried about performance. I personally tend to try to use the more performant option when there's a choice, but a lot of times it just doesn't matter.

That said, I suspect the example would be much faster if you didn't use the pipeline, because a single tool could do it all (I'm leaving in the substitution and column print that are actually unused in the result):

    awk '/^x/{gsub("a","b");print $2; count++}END{print NR}' foo.txt
That syntax is very unusual from anything I've seen. I am also a fan of splitting pipelines with line breaks for readability, however I put the pipe on the end of each line and omit the backslash. In Bash, a line that ends with a pipe always continues on the next line.

In any case, it's probably just a matter of personal taste.

That's actually very readable. I'm now regretting that I hadn't seen this about 3 months ago--I recently left a project that had a large number of shell scripts I had written or maintained for my team. This probably would've made it much easier for the rest of the team to figure out what the command was doing.
If the order is your concern, you can also put the <read at the beginning of the line. <file grep x works the same as: cat file | grep x
I've been using unix for 25 years and I did not know that.
I dunno, You are bringing 5 cores to bear and there is no global interpreter lock which is not a bad start
I like 'collection pipeline' code written in this style regardless of language. If we took away the pipe symbols (or the dots) and just used indentation we'd have something that looked like asm but with flow between steps rather than common global state.

I periodically think it would be a good idea to organize a language around.

awk can do all of that except sed. And I am not sure about the last. No need to wc ($NF in AWK, if I can recall), no need for grep, you have the /match/ statement, with regex too.
> except sed

Doesn't gsub(/a/, "b") do the same thing as s/a/b/g?

Yes, I recall it hours ago.
I find something like this:

   grep '^x' < input | sed 's/foo/bar/g' 
to be very readable, as the flow is still visually apparent based on punctuation.
I don't like this style at all. If you're following the pipeline, it starts in the middle with "input", goes to the left for the grep, then to the right (skipping over the middle part) to sed.

     cat input | grep '^x' | sed 's/foo/bar/g'
Is far more readable, in my opinion. In addition, it makes it trivial to change the input from a file to any kind of process.

I'm STRONGLY in favor of using "cat" for input. That "useless use of cat" article is pretty dumb, IMHO.

Note that '<input grep | foo' is also valid.
In this particular example, ‘unnecessary use of cat’ is accompanied by ‘unnecessary use of grep’.

    cat input | grep '^x' | sed 's/foo/bar/g'

    sed '/^x/s/foo/bar/g' <input
That's not the same thing. The sed output will still keep lines not starting with x (just not replacing foo with bar in those) where grep will filter those out.
Yeah, Muphry's law at work. Corrected version:

   sed -n '/^x/s/foo/bar/gp' <input
This may be an inadvertent argument for the ‘connect simpler tools’ philosophy.
You can just remove the <
You can if input is a file. It might be a program with no arguments or something else.
In your original command, how can 'input' be a program with no arguments?
Oh, damn. You're exactly right.

OK, to save some of my face, this will work:

    grep 'foo' <(input) | sed 's/baz/bar/g'
... at least in zsh and probably bash.
I don’t like that at all. That creates a subshell and is also less readable than

    input | grep foo | sed ...
This is a very silly way of writing it though. grep|sed can almost always be replaced with a simple awk: awk '/^x/ { sub("a", "b"); print $2; }' foo.txt. This way, the whole command fits on one line. If it doesn't, put your awk script in a separate file and simply call it with "awk -f myawkscript foo.txt".
I would disagree that their way of writing it is silly.

It is instantly plainly obvious to me what each step of their shell script is doing.

While I can absolutely understand what your shell script does after parsing it, it's meaning doesn't leap out at me in the same way.

I would describe the prior shell script as more quickly readable than the one that you've listed.

So, perhaps it's not a question of one being more silly than the other—perhaps the author just has different priorities from you?

I use awk in exactly this way personally, but, awk is not as commonly readable as grep and sed (in fact, that use of grep and sed should be pretty comprehensible to someone who just knows regular expressions from some programming languages and very briefly glances at the manpages, whereas it would be difficult to learn what that awk syntax means just from e.g. the GNU awk manpage). So, just as you could write a Perl one-liner but you shouldn't if you want other people to read the code, I'd probably advise against the awk one-liner too.
Not sure why you say grep and sed are more readable than awk! (not sure what 'commonly readable' means). Or that even that particular line in awk is harder to understand than the grep and sed man pages. The awk manpage even has examples, including print $2. The sed manpages must be the most impenetrable manpages known to 'man', if you don't already understand sed. (People might already know s///g because 99% of the time, that's all sed is used for.)
>sub("a", "b");

That should be gsub, shouldn't it? (sub only replaces the first occurrence)

Yes.