|
But there is already a de facto data serialization in Unix: tabs/spaces as field separators and newlines as record separators. That's what allows ls and wc to interoperate, but it makes it such that ls and cut don't interoperate without a transformation in between, because of ls's use of columnar layout rather than separator based layout. Or the output of a 'uniq -c', you can then 'sort -g' to order lines by the number of occurrence, but if you want to take the top 5 lines and discard the counts with 'cut', you need whitespace transformations in between. (AWK would be an alternative to cut that performs the whitespace conversions on its own, but AWK is a full blown programming language, so one may as well have something that deals with dictionaries, arrays, etc, as input types anyway). All this is to say that the untyped bytestream relies on conventions in Unix to make the bytestream useable between composable programs. These conventions are adequate for current uses, but show some weaknesses, and suggest that perhaps there are additional conventions, that if sufficiently simple, could be used to build composable programs that don't need to understand a format that is particular to just one program. I totally agree that objects (meaning data + associated data-specific code) are probably overkill, though optional object interfaces may be nice if the programmer is willing to pay the computational cost, like one does when using AWK over cut. And I definitely think that inheritance is an idea best to avoid for data. |
De facto is a far cry from "required by the OS". You'll notice that the programs we both mentioned (as well as programs that interpret whitespace this way) tend to operate on human-readable text, which uses whitespace to separate words. However, there are many other programs that do not operate on human-readable text, and do not rely on whitespace to delimit records in a pipe (or other file-like construct). Thus, the OS should not try to enforce a One True Record Delimiter, since there isn't one.
> ...and suggest that perhaps there are additional conventions, that if sufficiently simple, could be used to build composable programs that don't need to understand a format that is particular to just one program.
This really is the heart of the matter. There is a trade-off between the specificity of the conventions and the freedom of the program to interpret the data however it wants. UNIX is at (almost) the extreme right end of this spectrum--the only convention it imposes is that information must be represented as 8-bit bytes.
My question to those who feel that UNIX is too far to the right on this spectrum is, what are some conventions that can be adopted universally that won't break composability? I'm not convinced that there are any. Even simple things like requiring programs to communicate via a set of untyped key/value pairs (where each key and value is a string of bytes) would be risky, since it could easily lead to the creation of disjoint sets of programs which only work with members of their own sets (e.g. members of each set would require set-specific key/value pairs).