Hacker News new | ask | show | jobs
by frankerz 4114 days ago
How does the > process substitution differ from simply piping the output with | ?

For example (from Wikipedia)

tee >(wc -l >&2) < bigfile | gzip > bigfile.gz

vs

tee < bigfile | wc -l | gzip > bigfile.gz

4 comments

Say that you have a program that splits its output into two files, each given by command line arguments. A normal run would be

    <input.txt munge-data-and-split -o1 out1.txt -o2 out2.txt
but since the output is huge and your disk is old and dying, you want to run xz on it before saving it to disk, so use >():

    <input.txt munge-data-and-split -o1 >(xz - > out1.txt) -o2 >(xz - > out2.txt)
If you want to do several things in there, I recommend defining a function for clarity:

    pp () { sort -k2,3 -t$'\t' | xz - ; }
    <input.txt munge-data-and-split -o1 >(pp > out1.txt) -o2 >(pp > out2.txt)
In the tee case the substitution is actually going somewhere different than standard out (that's what tee does).

so:

    cmd1 | tee out.txt | cmd2
So tee is splitting the stream into two outputs, one that carries on out stdout (into cmd2) and the other one that is redirected into out.txt.

With process substitution you can do extra stuff on the way out, I guess (I've never seen it used for output before).

It looks like in the example given they're writing wc stuff to stderr while zipping the content (over stdout).

Nice to see that example, I hadn't even thought about the usefulness of process substitution for outputting like this!

When you connect to processes in a pipe such as ...

    a | b
you connect stdout (fd #1) of a to stdin (fd #0) of b. Technically, the shell process will create a pipe, which is two filedescriptors connected back to back. It then will fork two times (create two copies of itself) where it replaces standard output (filedescriptor 1) of the first copy by one end of the pipe and replaces standard input (filedescriptor 0) of the second copy by the other end of the pipe. Then the first copy will replace itself (exec) by a, the second copy will replace itself (exec) by b. Everything that a writes to stdout will appear on stdin of b.

But nothing prevents the shell from replacing any other filedescriptor by pipes. And when you create a subprocess by writing "<(c)" in your commandline, it's just one additional fork for the shell, and one additional filedescriptor pair to be created. One side, as in the simple case, will replace stdin (fd #0) of "c"... and because the input side of this pipe doesn't have a predefined output of "a" (stdout is already taken by "|b") the shell will somehow have to tell "a" what filedescriptor the pipe uses. Under Linux one can refer to opened filedescriptors as "/dev/fd/<FDNUM>" (symlink to /proc/self/fd/<FDNUM> which itself is a symlik to /proc/<PID>/fd/<FDNUM>), so that's what's replaced as a "name" to refer to the substituted process on "a"'s command line:

Try this:

    $ echo $$
    12345  # <--- PID of your shell
    $ tee >( sort ) >( sort ) >( sort ) otherfile | sort
and in a second terminal

    $ pstree 12345 # <--- PID of your shell

    zsh,301
      ├─sort,3600 # <-- this one reads from the other end of the shell's fd #14
      ├─sort,3601 # <-- this one reads from the other end of the shell's fd #15
      ├─sort,3602 # <-- this one reads from the other end of the shell's fd #15
      ├─sort,3604 # <-- this one reads from stdout of tee
      └─tee,3603 /proc/self/fd/14 /proc/self/fd/15 /proc/self/fd/16 otherfile
If your system doesn't support the convenient /proc/self/fd/<NUM> shortcut, the shell might decide not to create a pipe, but rather create temporary fifos in /tmp and use those to connect the filedescriptors.

http://man7.org/linux/man-pages/man2/pipe.2.html

http://linux.die.net/man/2/dup

You can watch the syscalls as they are made:

    $ strace -fe fork,pipe,close,dup,dup2,execve bash -c 'tee <(sort) <(sort)'
It allows multiple, parallel pipes to each individual command, where the | allows just one.