Hacker News new | ask | show | jobs
by AnthonyMouse 1754 days ago
"The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point. Frequently the messages have meaning; that is they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communication are irrelevant to the engineering problem."

- Claude Shannon, A Mathematical Theory of Communication

Everything is inherently a stream of bytes. You can use programs that interpret them as typed data or newline or null-delimited arrays all you like.

2 comments

Thing is, without a build in (good) way to do e.g. array types, every program will roll their own. That is lots of duplicate work, and makes having programs interact a manual process of integrating 2 unspecified ad-hoc standards.

Programming languages create idioms and standard ways to do things. We judge those for their clarity and quality. On that front UNIXs "you get a stream if bytes, do what you want" leads to a fragmented world of low quality idioms. And hence gets judged poorly.

Except that there are standard ways of doing arrays. Newline terminated is the most common, null-terminated is used for data that could have unusual characters in it, and most of the standard utilities support both.

The problem with not doing this is that the operating system is imposing some arbitrary standard on data that may not have even come from the same system.

If the data you get is JSON, there are utilities for parsing JSON. If it's XML, there are utilities for parsing XML. It could be a PDF and you want to convert each page to a TIFF and then tar them into an archive. The file came over the internet; it doesn't care about your system's encoding defaults.

Well, in light of the topic of the original article, I'd just compare working with byte streams to working with a smalltalk style object, and see which one is more pleasant. (You could still do a byte stream in smalltalk land, you'd just have to be explicit, it wouldn't be the default)
Standard formats like JSON, XML, and PDF are arbitrary encoding defaults, but on a slightly higher level.

If I send your system something obscure like an SDIF audio file you're not going to have the tools to do anything with it unless you can find and install an SDIF library.

What if the OS could do this automatically because a typed schema and standard API was built into the SDIF (or any) standard and installable in the same way code dependencies are?

> If I send your system something obscure like an SDIF audio file you're not going to have the tools to do anything with it unless you can find and install an SDIF library.

Which is the point. So then you find and install an SDIF library. The OS can't reasonably predict and support every possible file format that anyone will ever come up with in the past or future, so the answer is to be flexible and use different utilities for different formats.

> What if the OS could do this automatically because a typed schema and standard API was built into the SDIF (or any) standard and installable in the same way code dependencies are?

The problem with this is that what you can do with a file depends on the format. You can split a PDF into pages; you can't split an SDIF into pages. An SDIF might have channels; a PDF doesn't have channels. You could call each frame in a video a "page" and try to compare them with a PDF, but the video "page" is going to be a pure individual image whereas a PDF page might have text or multiple individual embedded images.

Everything is so specific to the format that you're going to want format-specific utilities to operate on it that take into account the characteristics of the format.

At best you'll sometimes have multiple formats that are largely "the same" interface-wise like PNG and JPEG and then you would want to use the same interfaces to access both, but that's what already happens. You use e.g. ImageMagick when dealing with both PNGs and JPEGs because it supports both. But it doesn't support SDIF because you need a completely different interface when dealing with audio.

And file formats already have magic strings in them that identify what kind of file they are, which is about the extent of the information the OS could provide you about the file without assuming anything about what's inside it. The schema is already implied by the header magic.

There are indeed standard ways, but they are _bad_, they are badly specified, but mostly they are just bad but easy ideas. Not to say that anyone who wrote a bash tool that uses newlines or nulls as sentinels is bad programmer. But to say that UNIX IPC offering nothing better than newlines or nulls as idomatic is a big downside of UNIX.
There are standard ways of doing arrays

Is that why 80% (at a guess) of shell scripts don’t handle spaces in file names correctly?

Nothing is inherently a stream of bytes. If all you have is a stream of bytes without a context, you have absolutely no idea what it's supposed to be and no clue how to do anything useful with it.

The whole point of OP is that it should be easy - or at least possible - to access contextual meta-information. Because this makes it easier to to build tools that work with meaning automatically instead of relying on unreliable, partial, and/or broken manual context handling based on unstated or implied assumptions.

You can certainly have a decent idea of what information is there by just looking at the bytes and reverse engineer the protocol. This is how unix pipelines are composed as a matter of fact.

Now, of course it would it be better if each program formally described the schema of its output and the system provided tooling to act on this information, but being able to string together uncooperative programs via their common denominator (i.e stream of bytes), is powerful and the unix way.