Hacker News new | ask | show | jobs
by 0xbadcafebee 941 days ago
One of the major failures of the modern computer science age (among others) is a lack of direction away from traditional i/o. We still are stuck on files and directories and tcp sockets. Yet what we actually want to do with i/o is not read a file from a local disk, or connect to a server and transmute the contents of the file over some additional protocol.

What we really want is to store some data somewhere, and later be able to retrieve it, without necessarily knowing what it was we stored or where or how. And we don't want to think about what server it's on, or what hard drive, or what folder. And we don't want to think about client protocols or query languages.

All of that would be possible if we reinvented i/o. Basically, just imagine what you want your experience to be, and then start making up names for functions that do that. Stuff that in a kernel, or a standard library. Now you have i/o that's based on how you really want to use data. The backend implementation of it can vary, but the point is to make the user experience what we actually want rather than what somebody else thinks is practical. Make the data interface you want to use, and make it a standard.

5 comments

Many decades ago, before we had files, we had data storages that do what you describe. They stored records of data. No need for files.

What happened, is we discovered that files are really useful because you don't need to declare the format of data that goes into the file. So the operating system can handle things like reading and writing and the application can organise how it wants to keep the data in the file.

The same really is for sockets. It is really useful to have somebody transfer the data for you in a stream and you, the application, only worry about the format of the data.

A junior engineer on my team asked me why we store bytes in our Blobstore/Filesystem rather than something structured like a DB.

Bytes are a "narrow waist" and in-fact DBs actually use our system for storage. By supporting bytes, anything that can be serialized can be stored by the next layer up and the contract is very simple.

We have this right now. It's abstractions on top of the real primitives. That's what client protocols and query languages are.
> And we don't want to think about what server it's on, or what hard drive, or what folder. And we don't want to think about client protocols or query languages.

Different types of data are legal in different jurisdictions (for example the definition of PII data), the physical location of the hard drive matters.

When medical data is stored, where and how is important. When handling data that needs to, legally, needs an audit trail, abstractions won't do.

When data is needed at low latency, the details matter. When cost is important (egress charges per operation or counted by size of data transfered), details matter.

> the physical location of the hard drive matters

Not exactly: what matters is the legal designation of the data storage device. The location of that device is one of many factors that "matter", but not to the application, or developer, or user. They only "matter" to the law. We aren't going to start writing UnitedStatesFileWrite() functions, now, are we?

Instead of considering the physical location of a hard drive, what we should be doing is querying a data storage object which has the properties we want:

  io_construct = DataStorage()
  storage_search = io_construct.DataStorageSearch({
    "contains": [
      { "legal": {
          "jurisdiction": {
            "location": [ {
              "country": "US",
              "state": "California"
            } ]
          }
        }
      },
      { "record": [ { "email": "foo@bar.domain" } ]
      }
    ]
  })
  with io_object as io_construct.AttachDataStorage(device = storage_search):
    io_object.read()
We should never have to think about what building a hard drive is located in, much less the complexities of dealing with specific data laws. The IO construct should deal with that.
I think the details of IO are already abstracted pretty well, it's a topic that's had a lot of effort put into it. The remaining things you have to think about are pretty fundamental and not fundamentally technical in nature, like:

1. Price

2. Brand of whoever is providing the storage (matters because it's a proxy for lots of other details)

3. General physical location

Once you made those decisions services like S3 abstract the rest. There are tools that let you access these via FUSE (in which case client protocols don't matter).

Wouldn't the data interface still be a stream of bytes?
No reason it has to be. You could have the data interface accept plugins which preprocess data in different formats and expose it as something else, like an object, document, stream of documents, etc.
Those are all ultimately represented as bytes. You're just looking for a different query language.