Hacker News new | ask | show | jobs
by makmanalp 659 days ago
Hi friend. I, like you, am exactly that person who reads those docs very carefully and has to know about the seemingly innocuous one liners as well as the one liners that are not there but should be, that eventually make it in after the next point release. Or the bug report that's marked "S3 (non-critical)" that sneakily awaits for days in production until it whacks something very much critical. Or the bug that celebrated its 14th birthday. What I understand about how some of these implementations work makes me very sad sometimes. And I'm telling you that the state of things is silly.

The thing is - and I imagine you know this - that when you run tens of thousands of database instances it's not a matter of if, it's a matter of when. I can't chase after thousands of engineers to make sure they're not hitting some obscure pitfall nor can I demand that they know the ins and outs of the 8 different types of locks in FooDB. And "read the docs lol" definitely doesn't scale. So we build automated mass-guardrails and linters as best we can and use various shenanigans to deal with the real world.

In the context of schema migrations, there's tools like pt-osc, gh-ost and pgroll that are excellent for what they are and I'm grateful to those who created them but these are still in effect huge hacks that are bolted on top of the database. They're teeming with footguns. I can tell you about a time where badly conceived timeout / error handling code in one of these tools in their golang database driver resulted in most of a database dropped instantly with no fast rollback, for example. That was fun.

If a third party bolt on tool can implement zero(ish) downtime migrations and cutovers, at least much much much better than the regular DDL, why the heck is the database not doing whatever that thing is doing? It's tech debt, history, crud, lack of caring, etc. Schema migrations is just the tip of the iceberg too, don't get me started on replication, durability settings, resource usage vs capacities ...

My point is that academics and vendors should be taking operational problems seriously and owning them, and solving them properly at the database layer.

And just like you I also make living off of the fact that the real world operational characteristics of truly busy databases are very different than what gets benchmarked and tested. I'm just saying that some of this stuff even I don't want to have to deal with. There's better problems to be working on.