Hacker News new | ask | show | jobs
by 0xbadcafebee 1977 days ago
Am I the only one who thinks it's bizarre that a structured query language defines so much of how we choose to architect and operate our systems?

Think about it for a sec: SQL is literally just a language to query and manipulate data. There's no reason that schema changes and data changes have to happen only through the one language, and only through one interface on one piece of software.

For whatever reason, this has just been how the most popular products have done it, and they largely just never changed their designs in 40 years. I like the language, and the general organization of the data is handy. But everything else about it is archaic.

Why fumble around with synchronization? 99% of the data in big datasets doesn't change. This doesn't even have to be "log-based", we just need to be able to ship the old, stable data and treat it almost like "cold storage".

Why is there a single point of entry into the data? You have to use the one database cluster to access the one database and the one set of tables. Why can't we expose that same data in multiple ways, using multiple pieces of software, on multiple endpoints?

Other protocols and languages have ways of dealing with these kinds of things. LDAP can refer you to a different endpoint to process what you need. Web servers can store, process, and retrieve the same content across many different endpoints in a variety of ways. Lots of technology exists that can easily replicate, snapshot, version-control, etc arbitrary pieces of data and expose them to any application using standard interfaces.

Why haven't we created a database yet which works more like the Unix operating system?

10 comments

Are you kidding, LDAP referrals as a model of how to do it? I mean, I did a lot of LDAP work back in the day, and that's not a feature that saw a lot of action outside academia. Just write your own thing on top of HTTP, that's got referrals too!

There are practically many ways of talking to database systems, if it isn't too troubling that some SQL is often happening somewhere. Like, there's Hasura, postgrest, etc.; or Mongo has a variety of drivers that support different inputs.

One might consider the most unix'y database to be Berkeley DB/Sleepycat, but that is probably not what you wanted. ;)

> Why haven't we created a database yet which works more like the Unix operating system?

Not to be overly snarky, but have you tried? Database design is full of trade-offs.

When I was a kid and learning to program, I wrote some shitty databases for fun. I learned about the trade-offs and that it was easy to write a database that out-performed RDBMSes in specific criteria. But I hadn't thought of making them extensible.

I have a pet project I'm working on, which is a generi distributed system where each component is a microservice. It turns out there's lots of these things built already, mainly by systems engineers for obscure things (Airflow, Rundeck, Stackstorm being some examples). I'll probably think about how I can redesign my project with composeable databases in mind. I don't expect I'll ever have a working product, but it'll be useful to think about this problem.

You're basically telling them to put in months of work to find out. Even if it's not too snarky, it's a ridiculous way to learn something that could be conveyed pretty well in a blog post or a chapter of a book.
I think that's a fair observation, but in this case, a dating app, a relational database and SQL seem like a great fit. The ends users are fairly literally SELECTING and JOINING with LIMITS and newly FOREIGN relationships and so forth :)
Why are we still using ASCII or Unicode character interfaces in shells? Because like SQL they work and are moderately well understood.

There are many query languages and having one common one as a base is useful to transfer skills. Think of it as an on ramp to more specific dialects or technologies.

You just described distributed databases -- which are overwhelmingly now deciding to use SQL as their interface of choice. You're completely hand waving over the fact that data is just a bunch of bits on disk grouped into pages. Everything above that fact is a tradeoff.
Actually not really. A Unix operating system can do everything I described with regular-old data, and it's not a distributed operating system. It simply has extensible standard interfaces.

Do you need a distributed database to read a .txt file with cat, Emacs, and Firefox? No. Why not? Because there's an I/O standard they all use. Does that .txt file have to live on a single filesystem, or disk? No. Why not? Because the storage mediums all have a standard virtual filesystem interface.

There is no reason databases cannot do exactly the same thing. It's just that nobody has made them do it yet. They've stuck with the exact same paradigm, and that then drives how all of us build our systems, with this archaic 40 year old model that requires heavy-lifting maintenance windows and a lot of praying.

You're interweaving several different issues here.

> Why fumble around with synchronization? 99% of the data in big datasets doesn't change. This doesn't even have to be "log-based", we just need to be able to ship the old, stable data and treat it almost like "cold storage".

This is not a feature of SQL, this is a feature of the database. Also, this sounds exactly like doing full-table replication to get the "old" data and then turning on log-based replication. You can do key-based replication if you really want to avoid log-based, but it's generally just a less efficient version of log-based replication.

> Why is there a single point of entry into the data? You have to use the one database cluster to access the one database and the one set of tables. Why can't we expose that same data in multiple ways, using multiple pieces of software, on multiple endpoints?

You can. Postgres supports both Perl and Python extensions that run in the RDBMS process, iirc. Very few people use them because running in the RDBMS process means that you can break the RDBMS process in really bad ways, and it is very difficult to gain any benefits over just running a separate process that communicates over SQL.

So if you consider other processes that communicate with the database and then show views of that over other protocols, that describes most of the backend apps in the world.

There's also stuff like Presto[1] that allows you to run queries distributed over multiple databases, multiple types of databases, etc, etc, etc. In that case, conceptually, Presto is "the database" and all the records you refer to are remote.

1: https://prestodb.io/

> This is not a feature of SQL, this is a feature of the database

Yet they always seem tied together eh? Somehow the conventions are stuck together, and that then affects how our systems work.

> Postgres supports both Perl and Python extensions that run in the RDBMS process

But I'm talking about not having to use the RDBMS process. If I have a text file on the disk, I can use a million different programs to access it. I don't have to run one program through another program just to open, read, write, and close the file with any program. Why don't we design our databases to work this way?

> Very few people use them because running in the RDBMS process means that you can break the RDBMS process in really bad ways

Yes, it does sound bad. That's why I'd prefer an indirect method rather than having to wedge access through the RDBMS

> So if you consider other processes that communicate with the database and then show views of that over other protocols, that describes most of the backend apps in the world.

Yep! We architect entire systems-of-systems just because the model for our data management in an RDBMS is too rigid. We're building spaceships to get to the grocery store because we haven't yet figured out roads and cars.

You're not the only one. There are lots of better alternatives to SQL databases for most use cases (I'm lucky enough to have worked in some places where SQL datastores were the exception rather than the rule). But it takes a long time for cultural change to happen.
Would you mind mentioning some good options? I've always been interested in databases, but find it hard to know which ones to learn more about and when they'd actually be worth investing in (especially since it's hard to build knowledge from toy projects).
Honestly all the popular datastores are fine, though they all have their own foibles. Redis is fine if you just want a basic key-value store that's not HA (and not particularly durable). Cassandra is fine if you want HA (Riak had a better design, but is so much less popular that I probably wouldn't use it these days). Kafka is really good but a much bigger conceptual leap. Even MongoDB is pretty much fine - there are plenty of horror stories about data loss in earlier versions, but it pretty much works most of the time, which is the best you can ever really say about any datastore IME (even PostgreSQL).
Putting data into cold storage, spinning up multiple flexible access points with different datasets... Sounds like what Snowflake is doing right? I don‘t really use it but looks neat from the outside. May be nice to bring some of that to OLTP.
Life is about tradeoffs. Complexity, latency, cost and so on. Things in general are much harder to implement correctly (see Jepsen tests) than to talk about in broad terms.
Not sure why you are downvoted, you made a lot of very valid points and I agree.

People get very comfortable very quickly, even tech savvy folks. Having to learn another language will scare many away, even though the effort might be the same - it's perceived harder.