Hacker News new | ask | show | jobs
by twic 3453 days ago
That particular example of find not needing to sort persistently annoys me. sort doesn't know anything about the structure of its input, so it has to read and buffer all of it before it can sort it. find knows that its input is a tree of strings, which it could exploit to produce sorted output at the cost of buffering one directory's worth of filenames at each level of the tree.

It's rarely a significant problem in practice, but it annoys me in principle!

1 comments

To avoid cluttering "find" with a sorting interface, we could use the modern technique processing push-down:

If you do "blah | sort", then "sort" could ask its upstream processing node whether it supported sorting on the requisite fields, and "push down" the necessary sort-order descriptor into the "blah" step.

That requires two things: That the pipe API sets up a communications channel between the two programs in a way that makes them aware of each other and able to exchange information; and secondly, that the pipe protocol is based on typed, structured data. I want both things.

Imagine if you had that, then you could conceivably also do:

    psql -c "select firstname, lastname from foo" |
      sort -f lastname
and psql would automagically rewrite its query to:

    select firstname, lastname from foo order by lastname
That's the future I want to live in, anyway.

The inability to do this sort of thing really a product of a failure to modernize the 1970s text-oriented pipe data model. I believe PowerShell (which I've never used, only read about) provides a mechanism to accomplish this sort of thing, at the expense of being extremely Microsoft-flavoured.

I don't think there's anything even vaguely scifi about those abilities, but the Unix world is hampered by a curious reticence to innovate certain core technologies such as, well, Unix itself. That's why we still have tmux and such.

... or the downstream program could ask the upstream one (or get automatically along with the stream) about meta-data/type information for the stream it is being passed, and then it could benefit fully from already-known information. Though that does not solve the need to potentially read the full stream and buffer it before doing the processing.
I certainly wouldn't want to implement psql if it neededs to handle all that extra communication.