There is general wisdom about bash pipelines here that I think most people will miss simply because of the title. Interesting though, my mental model of bash piping was wrong too.
There were several reasons why pipes were added to Unix, and the ability to run producer/consumer processes concurrently was one of them. Before that (and for many years after on non-Unix systems) indeed the most prevalent paradigm were to run multi-stage pipelines with the moral equivalent of the following:
stage1.exe /in:input.dat /out:stage1.dat
stage2.exe /in:stage1.dat /out:stage2.dat
del stage1.dat
stage3.exe /in:stage2.dat /out:result.dat
del stage2.dat
Pipes are so useful. I find myself more and more using shell script and pipes for complex multi-stage tasks. This also simplifies any non-shell code I must write, as there are already high quality, performant implementations of hashing and compression algorithms I can just pipe to.
My biggest annoyance is when I get some tooling from some other team, and they're like "oh just extend this Python script". It'll operate on local files, using shell commands, in a non-reentrant way, with only customization from commenting out code. Maybe there's some argparse but you end up writing a program using their giant args as primitives.
Guys just write small programs and chain them. The wisdom of the ancients is continuously lost.
Python comes with a built-in module called fileinput that makes this very easy. It checks sys.argv[1] and reads from it or from stdin if it's empty or a dash.
It's just a preference thing, I loathe the small program chaining style and cannot work with it at all. Give me a python script and I'm good though. I can't for the life of me imagine why people would want to do pseudo programming through piping magic when chaining is so limited compared to actual programming
This is of course a false dichotomy, there's nothing pseudo about using bash (perhaps you mean sudo?) and bash scripts orchestrate what you call 'actual' programs.
I commonly write little python scripts to filter logs, which I have read from stdin. That means I can filter a log to stdout:
"The programmer scoffed at Master Foo and rose to depart. But Master Foo nodded to his student Nubi, who wrote a line of shell script on a nearby whiteboard, and said: “Master programmer, consider this pipeline. Implemented in pure C, would it not span ten thousand lines?”"
Well they all do, but in terms of ease of use, tar and zip are much simpler to implement in a cli pipeline than to write bespoke code. At least that has been my experience.
It is hard to compete with "| gzip" in any programming language. Just importing a library and you're already well past that. Just typing "import" and you're tied! Overbudget if I drop the space in "| gzip".
This is one of the reasons why, for all its faults, shell just isn't going anywhere any time soon.
You can also (assuming your language supports it), execute gzip, and assuming your language gives you some writable-handle to the pipe, then write data into it. So, you get the concurrency "for free", but you don't have to go all the way to "do all of it in process".
I've also done the "trick" of executing [bash, -c, <stuff>] in a higher language, too. I'd personally rather see the work better suited for the high language done in the higher language, but if shell is easier, then as such it is.
It's sort of like unsafe blocks: minimize the shell to a reasonable portion, clearly define the inputs/outputs, and make sure you're not vulnerable to shell-isms, as best as you can, at the boundary.
But I still think I see the reverse far more often. Like, `steam` is … all the time, apparently … exec'ing a shell to then exec … xdg-user-dir? (And the error seems to indicate that that's it…) Which seems more like the sort of "you could just exec this yourself?". (But Steam is also mostly a web-app of sorts, so, for all I know there's JS under there, and I think node is one of those "makes exec(2) hard/impossible" langs.)
Sometimes you want the intermediate files as well, though. For example, if doing some kind of exploratory analysis of the different output stages of the pipeline, or even just for debugging.
Tee can be useful for that. Maybe pv (pipe viewer) too. I have not tried it yet.
SPONGE(1) moreutils SPONGE(1)
NAME
sponge - soak up standard input and write to a file
SYNOPSIS
sed '...' file | grep '...' | sponge [-a] file
DESCRIPTION
sponge reads standard input and writes it out to the specified file.
Unlike a shell redirect, sponge soaks up all its input before writing
the output file. This allows constructing pipelines that read from and
write to the same file.
Usually mental models develop "organically" from when one was a n00b, without much thought, and sometimes it can take a long time for them to be unseated, even though it's kind of obvious in hindsight that the mental model is wrong (e.g. one can see that from "slow-program | less", and things like that).
I think a main reason for this is that you can have a "good enough" working mental model of a process, that holds up to your typical use cases and even moderate scrutiny. It's often only once you run into a case where your mental model fails that you even think to challenge the assumptions it was built on - at least, this has been my experience.
Can’t speak for OP, but one might reasonably expect later stages to only start execution once at least some data is available—rather than immediately, before any data is available for them to consume.
Of course, there many reasons you wouldn’t want this—processes can take time to start up, for example—but it’s not an unreasonable mental model.
Well, it could be implemented like this, it's just more cumbersome than "create N-1 anonymous pipes, fork N processes, wait for the last process to finish": at the very least you'll need to select() on the last unattached pipe, and when job control comes into the picture, you'd really would like the "setting up the pipeline" and "monitoring the pipeline's execution" parts to be disentangled.
Not even that they might be particularly slow to start in absolute terms, but just that they might be slow relative to how fast the previous stage starts cranking out some input for it.
I know this about Unix pipes from a very long time. Whenever they are introduced it is always said, but I guess people can miss it.
Though now I will break your mind as my mind was broken not a long time ago. Powershell, which is often said to be a better shell, works like that. It doesn't run things in parallel. I think the same is to be said about Windows cmd/batch, but don't cite me on that. That one thing makes Powershell insufficient to ever be a full replacement of a proper shell.
Not exactly. Non-native PowerShell pipelines are executed in a single thread, but the steps are interleaved, not buffered. That is, each object is passed through the whole pipeline before the next object is processed. This is non-ideal for high-performance data processing (e.g. `cat`ing a 10GB file, searching through it and gzipping the output), but for 99% of daily commands, it does not make any difference.
cmd.exe uses standard OS pipes and behaves the same as UNIX shells, same as Powershell invoking native binaries.
Oh, that's what I missed! I managed to find out about it while trying to do an equivalent of `curl ... | tar xzf -` in Powershell. I was stumped. I guess the thing is that a Unix shell would do a subshell automatically.
> Though now I will break your mind as my mind was broken not a long time ago. Powershell, which is often said to be a better shell, works like that. It doesn't run things in parallel. I think the same is to be said about Windows cmd/batch, but don't cite me on that. That one thing makes Powershell insufficient to ever be a full replacement of a proper shell.
A Pipeline is PowerShell is definitely streaming unless you accidentally forces the output into a list/array at some point, e.g. try this for yourself (somewhere you can interrupt the script obviously as it's going to run forever)
Whether it runs in parallel depends on the implementation of each side. Interpreted powershell code does not run in parallel unless you run it a job, use ForEach-Object -Parallel, or explicitly put it on another thread. But the data is not collected together before being sent from one step from the next.
The character "|" has been introduced in computers in the language NPL at IBM in December 1964 as a notation for bitwise OR, replacing ".OR.", which had been used by IBM in its previous programming language, "FORTRAN IV" (OR was between dots to distinguish it from identifiers, marking it as an operator).
The next year the experimental NPL (New Programming Language) has been rebranded as PL/I and it has become a commercial product of IBM.
Following PL/I, other programming languages have begun to use "&" and "|" for AND and OR, including the B language, the predecessor of C.
The pipe and its notation have been introduced in the Third Edition of UNIX (based on a proposal made by M. D. McIlroy), in 1972, so after the language B had been used for a few years and before the development of C. The oldest documentation about pipes that I have seen is in "UNIX Programmer's Manual Third Edition" from February 1973.
Before NPL, the vertical bar had already been used in the Backus-Naur notation introduced in the report about ALGOL 60 as a separator between alternatives in the description of the grammar of the language, so with a meaning somewhat similar to OR.
That is right, but I did not want to provide too many details that did not belong to the topic.
FORTRAN IV did not have bit strings, it had only Boolean values ("LOGICAL").
Therefore all the logical operators could be applied only to Boolean operands, giving a Boolean result.
The same was true for all earlier high-level programming languages.
The language NPL, renamed PL/I in 1965, has been the first high-level programming language that has introduced bit string values, so the AND, OR and NOT operators could operate on bit strings, not only on single Boolean values.
If PL/I would have remained restricted to the smaller character set accepted by FORTRAN IV in source texts, they would have retained the FORTRAN IV operators ".NOT.", ".AND.", ".OR.", extending their meaning as bit string operators.
However IBM has decided to extend the character set, which has allowed the use of dedicated symbols for the logical operators and also for other operators that previously had to use keywords, like the relational operators, and also for new operators introduced by PL/I, like the concatenation operator.
Not to get all semiotic about it, but |x| notation is a pair of vertical lines. I'm sure that someone has written a calculator program where two 0x7D characters bracketing a symbol means absolute value, but if I've ever seen it, I can't recall.
Although 0x7D is overly specific, since if a sibling comment is correct (I have no reason to think otherwise), | for bitwise OR originates in PL/1, where it would have been encoded in EBCDIC, which codes it as 0x4F.
I'm not really disagreeing with you, the |abs| notation is quite a bit older than computers, just musing on what should count as the first use of "|". I'm inclined to say that it should go to the first use of an encoding of "|", not to the similarly-appearing pen and paper notation, and definitely not the first use of ASCII "|" aka 0x7D in a programming language. But I don't think there's a right answer here, it's a matter of taste.
Because one could argue back to the Roman numeral I, if one were determined to do so: when written sans serif, it's just a vertical line, after all. Somehow, abs notation and "first use of an encoded vertical bar" both seem reasonable, while the Roman numeral and specifically-ASCII don't, but I doubt I can unpack that intuition in any detail.
The language APL\360 of IBM (August 1968) and the other APL dialects that have followed it have used a single "|" as a monadic prefix operator that computes the absolute value and also as a dyadic infix operator that computes the remainder of the division (but with the operand order reversed in comparison with the language C, which is usually much more convenient, especially in APL, where this order avoids the need for parentheses in most cases).