| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jrockway 2103 days ago

I think relational databases have largely failed developers because they don't provide the features they actually need.

A common question that comes up is how to do zero-downtime schema changes. The answer is that there isn't one. A correct implementation would store each schema version in the database and when an application connects, it would specify which version it's speaking. The developer would supply a mapping on how to make vX data available to a vY program. But no relational database supports such a feature, so people are forced to tread carefully -- look at all the deployment software that exists to attempt to find changes with database migrations and treat them differently. Look at all the software people have written to even apply those migrations. It's staggering, all because in the 70s when these systems were designed, the thought of deploying your code multiple times a day was unheard of.

Another problem that comes up is transactional isolation. Most engineers, and even casual practitioners, "know" that transactions exist for cases where you want to perform multiple operations atomically. But very few of these people are running the transaction with an isolation level that provides those guarantees. They will write their program assuming that transactions are strictly serializable, but in fact they are using "read committed" or some other weak form of isolation that totally breaks their assumptions. Then the database gets into a weird state, and people are baffled as to how that could happen. The actual implementation is so different from the CS assumptions that it's not even something that crosses people's minds, and "read committed" behaving as "read committed" looks like a heisenbug in the rare case they actually notice what's going on. That's super bad.

These underlying problems have nothing to do with SQL or NoSQL, though. NoSQL can smooth over schema incompatibilities a bit (perhaps the schema specifies everything as a "field tag" instead of a name, so you can safely rename columns, or perhaps everything is "optional", so the application can detect that it's reading an old record when it's missing; you can also easily build your own versioning system on top because the data doesn't mean anything to the database engine itself), but you can still get yourself into a lot of trouble. The NoSQL databases also have a horrifying transactional cleanliness record. Postgres may be "read committed" by default but at least you can get real transactions if you ask for them; good luck ever getting them with some NoSQL databases.

I guess where I'm going with this is that database engines are focused on the wrong problems. The relational model is very good. But it's something you can bring yourself once you have a way to transactionally read and write keys with opaque values. You can also add indexing at the application layer, or triggers, or encryption, or auditing, or RBAC... whatever, it's just code. At the end of the day, picking a relational database just gets you a VERY opinionated set of defaults that is unlikely to be what your application needs and nearly impossible to change later... but the defaults are juuuuust good enough that nobody makes a real effort to change them. Meanwhile, we ignore the problems that actually plague developers; schema versioning, unusual data types (time series, large blobs), availability, replication, etc. The operational concerns have been ignored for decades, and it's slowing everyone down.

People are right to be looking for alternatives, even though we know that most of the alternatives have even worse problems. Someday, somewhere, someone will get it right.

1 comments

ludston 2103 days ago

Good post. I think it is also worth mentioning that the relational model has a few intrinsic performance issues and gotcha's that require some uncomfortable and creative wrangling.

Specifically, safely handling concurrent writes to different rows that are related to each other requires careful consideration, and when you have a join with clauses on both tables it is not possible to have perfect efficiency without multi-table indexes.

I think that these are not intuitive, but solvable. The troubles arise when an application doesn't consider these issues until after it hits scale (which there is sufficient concurrency and enough millions of rows for these problems to rear their heads).

link