| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by aforty 861 days ago
	There is general wisdom about bash pipelines here that I think most people will miss simply because of the title. Interesting though, my mental model of bash piping was wrong too.

3 comments

Joker_vD 861 days ago

There were several reasons why pipes were added to Unix, and the ability to run producer/consumer processes concurrently was one of them. Before that (and for many years after on non-Unix systems) indeed the most prevalent paradigm were to run multi-stage pipelines with the moral equivalent of the following:

    stage1.exe /in:input.dat /out:stage1.dat
    stage2.exe /in:stage1.dat /out:stage2.dat
    del stage1.dat
    stage3.exe /in:stage2.dat /out:result.dat
    del stage2.dat

jakogut 861 days ago

Pipes are so useful. I find myself more and more using shell script and pipes for complex multi-stage tasks. This also simplifies any non-shell code I must write, as there are already high quality, performant implementations of hashing and compression algorithms I can just pipe to.

jvanderbot 861 days ago

My biggest annoyance is when I get some tooling from some other team, and they're like "oh just extend this Python script". It'll operate on local files, using shell commands, in a non-reentrant way, with only customization from commenting out code. Maybe there's some argparse but you end up writing a program using their giant args as primitives.

Guys just write small programs and chain them. The wisdom of the ancients is continuously lost.

hiccuphippo 861 days ago

Python comes with a built-in module called fileinput that makes this very easy. It checks sys.argv[1] and reads from it or from stdin if it's empty or a dash.

https://docs.python.org/3/library/fileinput.html

anonymous-panda 861 days ago

I would recommend the python sh module instead of writing bash for more complex code. Python’s devenv and tooling is way more mature and safer.

dan_mctree 860 days ago

It's just a preference thing, I loathe the small program chaining style and cannot work with it at all. Give me a python script and I'm good though. I can't for the life of me imagine why people would want to do pseudo programming through piping magic when chaining is so limited compared to actual programming

samatman 860 days ago

This is of course a false dichotomy, there's nothing pseudo about using bash (perhaps you mean sudo?) and bash scripts orchestrate what you call 'actual' programs.

I commonly write little python scripts to filter logs, which I have read from stdin. That means I can filter a log to stdout:

   cat logfile.log | python parse_logs.py

Or filter them as they're generated:

   tail -f logfile.log | python parse_logs.py

Or write the filtered output to a file:

   cat logfile.log | python parse_logs.py > filtered.log

Or both:

   tail -f logfile.log | python parse_logs.py | tee filtered.log

It would be possible, I suppose, to configure a single python script to do all those things, with flags or whatever.

But who on Earth has the time for that?

dgfitz 860 days ago

Chaining pipes in python is quite obnoxious.

number6 861 days ago

"The programmer scoffed at Master Foo and rose to depart. But Master Foo nodded to his student Nubi, who wrote a line of shell script on a nearby whiteboard, and said: “Master programmer, consider this pipeline. Implemented in pure C, would it not span ten thousand lines?”"

http://catb.org/~esr/writings/unix-koans/ten-thousand.html

keybored 861 days ago

Ugh. I don’t feel that the spirit of those satirical Zen Koans is to be so self-congratulatory.

capitol_ 861 days ago

What programming language do you use where there isn't performant hashing/compression algorithms implemented as libraries?

jvanderbot 861 days ago

Well they all do, but in terms of ease of use, tar and zip are much simpler to implement in a cli pipeline than to write bespoke code. At least that has been my experience.

jerf 861 days ago

It is hard to compete with "| gzip" in any programming language. Just importing a library and you're already well past that. Just typing "import" and you're tied! Overbudget if I drop the space in "| gzip".

This is one of the reasons why, for all its faults, shell just isn't going anywhere any time soon.

deathanatos 861 days ago

It is hard to compete with.

You can also (assuming your language supports it), execute gzip, and assuming your language gives you some writable-handle to the pipe, then write data into it. So, you get the concurrency "for free", but you don't have to go all the way to "do all of it in process".

I've also done the "trick" of executing [bash, -c, <stuff>] in a higher language, too. I'd personally rather see the work better suited for the high language done in the higher language, but if shell is easier, then as such it is.

It's sort of like unsafe blocks: minimize the shell to a reasonable portion, clearly define the inputs/outputs, and make sure you're not vulnerable to shell-isms, as best as you can, at the boundary.

But I still think I see the reverse far more often. Like, `steam` is … all the time, apparently … exec'ing a shell to then exec … xdg-user-dir? (And the error seems to indicate that that's it…) Which seems more like the sort of "you could just exec this yourself?". (But Steam is also mostly a web-app of sorts, so, for all I know there's JS under there, and I think node is one of those "makes exec(2) hard/impossible" langs.)

fuzztester 861 days ago

Sometimes you want the intermediate files as well, though. For example, if doing some kind of exploratory analysis of the different output stages of the pipeline, or even just for debugging.

Tee can be useful for that. Maybe pv (pipe viewer) too. I have not tried it yet.

alas44 861 days ago

We are two!

adql 861 days ago

...how ? It's called pipe, not "infinitely large buffer that will wait indefintely till the command ends to pass its output further"

m000 861 days ago

That is called a sponge!

  SPONGE(1)                          moreutils                         SPONGE(1)

  NAME
         sponge - soak up standard input and write to a file

  SYNOPSIS
         sed '...' file | grep '...' | sponge [-a] file

  DESCRIPTION
         sponge reads standard input and writes it out to the specified file.
         Unlike a shell redirect, sponge soaks up all its input before writing
         the output file. This allows constructing pipelines that read from and
         write to the same file.

arp242 861 days ago

Usually mental models develop "organically" from when one was a n00b, without much thought, and sometimes it can take a long time for them to be unseated, even though it's kind of obvious in hindsight that the mental model is wrong (e.g. one can see that from "slow-program | less", and things like that).

maicro 861 days ago

I think a main reason for this is that you can have a "good enough" working mental model of a process, that holds up to your typical use cases and even moderate scrutiny. It's often only once you run into a case where your mental model fails that you even think to challenge the assumptions it was built on - at least, this has been my experience.

zamfi 861 days ago

Can’t speak for OP, but one might reasonably expect later stages to only start execution once at least some data is available—rather than immediately, before any data is available for them to consume.

Of course, there many reasons you wouldn’t want this—processes can take time to start up, for example—but it’s not an unreasonable mental model.

Joker_vD 861 days ago

Well, it could be implemented like this, it's just more cumbersome than "create N-1 anonymous pipes, fork N processes, wait for the last process to finish": at the very least you'll need to select() on the last unattached pipe, and when job control comes into the picture, you'd really would like the "setting up the pipeline" and "monitoring the pipeline's execution" parts to be disentangled.

OJFord 861 days ago

Not even that they might be particularly slow to start in absolute terms, but just that they might be slow relative to how fast the previous stage starts cranking out some input for it.

(Since, as GP said, not an infinite buffer.)

hawski 861 days ago

I know this about Unix pipes from a very long time. Whenever they are introduced it is always said, but I guess people can miss it.

Though now I will break your mind as my mind was broken not a long time ago. Powershell, which is often said to be a better shell, works like that. It doesn't run things in parallel. I think the same is to be said about Windows cmd/batch, but don't cite me on that. That one thing makes Powershell insufficient to ever be a full replacement of a proper shell.

MatejKafka 861 days ago

Not exactly. Non-native PowerShell pipelines are executed in a single thread, but the steps are interleaved, not buffered. That is, each object is passed through the whole pipeline before the next object is processed. This is non-ideal for high-performance data processing (e.g. `cat`ing a 10GB file, searching through it and gzipping the output), but for 99% of daily commands, it does not make any difference.

cmd.exe uses standard OS pipes and behaves the same as UNIX shells, same as Powershell invoking native binaries.

hawski 861 days ago

Oh, that's what I missed! I managed to find out about it while trying to do an equivalent of `curl ... | tar xzf -` in Powershell. I was stumped. I guess the thing is that a Unix shell would do a subshell automatically.

poizan42 861 days ago

> Though now I will break your mind as my mind was broken not a long time ago. Powershell, which is often said to be a better shell, works like that. It doesn't run things in parallel. I think the same is to be said about Windows cmd/batch, but don't cite me on that. That one thing makes Powershell insufficient to ever be a full replacement of a proper shell.

A Pipeline is PowerShell is definitely streaming unless you accidentally forces the output into a list/array at some point, e.g. try this for yourself (somewhere you can interrupt the script obviously as it's going to run forever)

    class InfiniteEnumerator : System.Collections.IEnumerator
    {
        hidden [ulong]$countMod2e64 = 0

        [object] get_Current()
        {
            return $this.countMod2e64
        }
        
        [bool] MoveNext() {
            $this.countMod2e64 += 1
            return $true
        }
        
        Reset() {
            $this.countMod2e64 = 0
        }

    }

    class InfiniteEnumerable : System.Collections.IEnumerable {
        InfiniteEnumerable() {}
        
        [System.Collections.IEnumerator] GetEnumerator() {
            return [InfiniteEnumerator]::new()
        }
    }

    [InfiniteEnumerable]::new() | ForEach-Object { Write-Host "Element number mod 2^64: $_" }

Whether it runs in parallel depends on the implementation of each side. Interpreted powershell code does not run in parallel unless you run it a job, use ForEach-Object -Parallel, or explicitly put it on another thread. But the data is not collected together before being sent from one step from the next.

MatejKafka 861 days ago

More compact example (not to scare the POSIX people away :) ):

    0..1000000 | where {$_ % 10 -eq 0} | foreach {"Got Value: $_"}

poizan42 861 days ago

The streaming behavior of the range operator is weird though. This is tested on PowerShell 7.4.1

    > 0..1000000000 | % { $_ }
    # Starts printing out numbers immediately
    > 0..1000000000
    # Hangs longer than I had patience to wait for
    > $x=0..100
    > $x.GetType()
    # IsPublic IsSerial Name     BaseType
    # -------- -------- ----     --------
    # True     True     Object[] System.Array

It's an array when I save it in a variable, but it's obviously not an array on the LHS of a pipe.

shiomiru 861 days ago

DOS also has a "pipe", which works exactly like that. (Obviously, since DOS can't run multiple programs in parallel.)

lylejantzi3rd 861 days ago

Pipe, |, was also commonly used as an "OR" operator. I wonder if the idea that you could "pipe" data between commands came later.

adrian_b 861 days ago

The character "|" has been introduced in computers in the language NPL at IBM in December 1964 as a notation for bitwise OR, replacing ".OR.", which had been used by IBM in its previous programming language, "FORTRAN IV" (OR was between dots to distinguish it from identifiers, marking it as an operator).

The next year the experimental NPL (New Programming Language) has been rebranded as PL/I and it has become a commercial product of IBM.

Following PL/I, other programming languages have begun to use "&" and "|" for AND and OR, including the B language, the predecessor of C.

The pipe and its notation have been introduced in the Third Edition of UNIX (based on a proposal made by M. D. McIlroy), in 1972, so after the language B had been used for a few years and before the development of C. The oldest documentation about pipes that I have seen is in "UNIX Programmer's Manual Third Edition" from February 1973.

Before NPL, the vertical bar had already been used in the Backus-Naur notation introduced in the report about ALGOL 60 as a separator between alternatives in the description of the grammar of the language, so with a meaning somewhat similar to OR.

hollerith 861 days ago

>as a notation for bitwise OR, replacing ".OR.", which had been used by IBM in its previous programming language, "FORTRAN IV".

Untrue: ".OR." in FORTRAN meant ordinary OR, not bitwise OR. I don't remember ever seeing bitwise OR or AND or XOR in FORTRAN IV.

adrian_b 861 days ago

That is right, but I did not want to provide too many details that did not belong to the topic.

FORTRAN IV did not have bit strings, it had only Boolean values ("LOGICAL").

Therefore all the logical operators could be applied only to Boolean operands, giving a Boolean result.

The same was true for all earlier high-level programming languages.

The language NPL, renamed PL/I in 1965, has been the first high-level programming language that has introduced bit string values, so the AND, OR and NOT operators could operate on bit strings, not only on single Boolean values.

If PL/I would have remained restricted to the smaller character set accepted by FORTRAN IV in source texts, they would have retained the FORTRAN IV operators ".NOT.", ".AND.", ".OR.", extending their meaning as bit string operators.

However IBM has decided to extend the character set, which has allowed the use of dedicated symbols for the logical operators and also for other operators that previously had to use keywords, like the relational operators, and also for new operators introduced by PL/I, like the concatenation operator.

hawski 861 days ago

I think the math usage was first. i.e. absolute value: |x|

samatman 860 days ago

Not to get all semiotic about it, but |x| notation is a pair of vertical lines. I'm sure that someone has written a calculator program where two 0x7D characters bracketing a symbol means absolute value, but if I've ever seen it, I can't recall.

Although 0x7D is overly specific, since if a sibling comment is correct (I have no reason to think otherwise), | for bitwise OR originates in PL/1, where it would have been encoded in EBCDIC, which codes it as 0x4F.

I'm not really disagreeing with you, the |abs| notation is quite a bit older than computers, just musing on what should count as the first use of "|". I'm inclined to say that it should go to the first use of an encoding of "|", not to the similarly-appearing pen and paper notation, and definitely not the first use of ASCII "|" aka 0x7D in a programming language. But I don't think there's a right answer here, it's a matter of taste.

Because one could argue back to the Roman numeral I, if one were determined to do so: when written sans serif, it's just a vertical line, after all. Somehow, abs notation and "first use of an encoded vertical bar" both seem reasonable, while the Roman numeral and specifically-ASCII don't, but I doubt I can unpack that intuition in any detail.

adrian_b 861 days ago

The language APL\360 of IBM (August 1968) and the other APL dialects that have followed it have used a single "|" as a monadic prefix operator that computes the absolute value and also as a dyadic infix operator that computes the remainder of the division (but with the operand order reversed in comparison with the language C, which is usually much more convenient, especially in APL, where this order avoids the need for parentheses in most cases).