| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jcgl 260 days ago

> Heresy warning. Maybe the inputs and outputs don’t look anything like CLI or stdio text. Maybe we move on from 1000-different DSLs (each CLI’s unique input parameters and output formats) and make inputs and outputs object shaped. Maybe we make the available set of objects, methods and schemas discoverable in the terminal API.

Entirely agree. Stdio text (which is really just stdio bytes) deeply limits how composable your shell programs can be, since data and its representation are tightly coupled (they're exactly the same). I wrote a smidgin here[0] on my blog, but take a look at this unix vs. PowerShell example I have there. Please look beyond PowerShell's incidental verbosity here and focus more deeply on the profoundly superior composition that you can only have once you get self-describing objects over stdio instead of plain bytes.

  $ # the unix way
  $ find . -name '*.go' -not -name '*_test.go' -ctime -4 -exec cat {} \; | wc -l
  7119
  $ # the powershell way
  $ pwsh -c 'gci -recurse | where {($_.name -like "*.go") -and ($_.name -notlike "*_test.go") -and ($_.LastWriteTime -gt (get-date).AddDays(-4))} | gc | measure | select -ExpandProperty count'
  7119

[0] https://www.cgl.sh/blog/posts/sh.html

2 comments

rbanffy 260 days ago

I have a distaste for the verboseness of PowerShell, but I also have concerns with the attempt to bake in complex objects into the pipeline. When you do that, programs up and down the stream need to be aware of that - and that makes it brittle.

One key aspect of the Unix way is that the stream is of bytes (often interpreted as characters) with little to no hint as to what's inside it. This way, tools like `grep` and `awk` can be generic and work on anything while others such as `jq` can specialize and work only on a specific data format, and can do more sophisticated manipulation because of that.

link

jcgl 259 days ago

> I have a distaste for the verboseness of PowerShell, but I also have concerns with the attempt to bake in complex objects into the pipeline. When you do that, programs up and down the stream need to be aware of that - and that makes it brittle.

Yeah, you definitely can't write tools for the unix shell that assume some kind of self-describing message encoding. I mean, you could, but you'd have to do a lot of work to wrap it so that it can work with unix byte streams at the edges. I believe oil shell and nushell have prior art on this. To your point, it should be telling that those are shells of their own, rather than tools for existing unix shells.

> One key aspect of the Unix way is that the stream is of bytes (often interpreted as characters) with little to no hint as to what's inside it. This way, tools like `grep` and `awk` can be generic and work on anything while others such as `jq` can specialize and work only on a specific data format, and can do more sophisticated manipulation because of that.

This seems backwards to me. grep and awk are extremely fragile because they have to look at what's inside. They have to read every byte, and the user of grep and awk must understand entirely what the incoming data will be.

Whereas with PowerShell or any other system with self-describing messages, the user makes some lightweight assertions about the abstract shape of the data--not the concrete shape of that data's representation.

link

fainpul 260 days ago

> When you do that, programs up and down the stream need to be aware of that - and that makes it brittle.

You can go from object- to good old text processing with *nix tools no problem.

Instead of using 100% PowerShell to count all lines in the text files:

  gc *.txt | measure | select -ExpandProperty count

you can switch to `wc` if you like:

  gc *.txt | wc -l

`gc` is Get-Content – basically cat. You can also use awk, sed, jq etc.

link

forgotpwd16 259 days ago

What GP talks about is illustrated with the following modification of your example:

  PS /home/user> gc *.txt | measure | cat | select -ExpandProperty Count
  Select-Object: Property "Count" cannot be found.
  Select-Object: Property "Count" cannot be found.
  Select-Object: Property "Count" cannot be found.
  Select-Object: Property "Count" cannot be found.
  Select-Object: Property "Count" cannot be found.
  Select-Object: Property "Count" cannot be found.
  Select-Object: Property "Count" cannot be found.
  Select-Object: Property "Count" cannot be found.
  Select-Object: Property "Count" cannot be found.

Essentially at every step you need to consider whether the preceding command outputs objects or not. This isn't the case with the Unix way. The pipeline always carries a stream of bytes. You only have to consider how to interpret that stream.

link

jcgl 259 days ago

> Essentially at every step you need to consider whether the preceding command outputs objects or not.

This is true, no matter what language, paradigm, or even universe you're in; data that gets passed into a pipeline needs to have the abstract shape that the pipeline expects. This is always true, and it's every bit as true of unix byte streams.

You can of course have pipelines that try to coerce data or assert its structure. The PowerShell example you showed does the latter, and raises an error message that the assertion failed.

Unix byte streams do neither. There's no coercion, no assertion. Just blind trust. When you have IFS set incorrectly, you simply get a wrong answer. When you grab the wrong field number with cut or awk, you get a wrong (or empty) answer. The input data matters every bit as much with unix as it does with every other system of computation. What changes are characteristics like brittleness and enforceability of invariants.

link

fainpul 259 days ago

jcgl already answered this well. My example was merely to show that you can go from PS objects to old fashioned text processing without issues. Any non-PS tool will just receive the text representation of the objects – exactly as you see them in the terminal. I wasn't trying to imply you can go back and forth between them like magic (which should be obvious).

Once you have lost the objects and work with simple text, you have to use the text processing tools of PS, if that's what you want to do. To continue your example:

  gc *.txt | measure | cat | sls count

sls (Select-String) is like grep.

(Note: this example is nonsense, just to show that it works)

link

mongol 260 days ago

Another question is if such pipelines should act on objects, or more structured text streams. Many programs can output json, is this a pragmatic way forward to extend and improve further?

link

jcgl 259 days ago

Replying out-of-order to your points.

> Many programs can output json, is this a pragmatic way forward to extend and improve further?

Yes, json is a pragmatic way to put a band-aid on this issue. It really is Good Enough™ for many things, and shell tools that optionally emit json are strictly an improvement, imo.

> Another question is if such pipelines should act on objects, or more structured text streams.

"Structured text" is inherently brittle, compared to a self-describing message format. For one, even when you structure your text, you will always end up dealing with escaping problems.

Even worse, imo, is that backwards compatibility becomes a terrible challenge. Let me illustrate with a best-case scenario for structured text: ASCII-only tabular data (nested, binary, unicode data is left as an exercise to the reader, in true unix fashion). The most natural thing to do in unix is to select the relevant fields with awk. More specifically, you select fields by their number. Well, now your pipeline is tightly coupled to column ordering.

In principle you could use awk to select based on field names, those field names are just text and so it becomes incumbent on you, the pipeline author, to make sure they don't get lost in transit. No easy tail or grep for you!

All of this becomes avoidable when you deal with self-describing messages instead of raw bytes.

link