Hacker News new | ask | show | jobs
by mjb 1704 days ago
What happens when that database fails? Are you OK losing some data, or do you want the data to be synchronously replicated off the machine and be available somewhere else after failure? Distribution isn't only about scale, it's also about availability.

What happens when that database loses some data? Do you want an up-to-the second backup, or point-in-time recovery? Or are you OK restoring last night's backup? Distribution isn't only about scale, it's also about durability.

What happens when you need to run an expensive business process ad-hoc? Do you want it to be easy to scale out reads, or to export that data to an analytics system? Or are you OK building something else to handle that case? Distribution isn't only about scale, it's also about flexibility.

What happens when you want to serve customers in one market, and make sure that their data stays local for regulatory compliance reasons or latency? Are you OK with having separate databases? Distribution isn't only about scale, it's also about locality.

2 comments

Fair points. I would argue that for most people a simple master-slave setup with manual failover will produce far fewer headaches than a "serverless" architecture.

When you are big enough to worry about the other issues, you surely are big enough to handle the requirements in-house. I see the dependence on some specific companies as the bigger threat to reliability.

The setup you describe is very much not simple. I worked at a place with very good DBAs and our replication setup caused us more downtime than anything else. Cockroach and Spanner exist because many programmers observed that what you describe is hard.
As a counter-anecdote: multiple startup projects I've worked on with separate MySQL setups where each had just a single master + two slaves (one warm for fast failover in case of hardware failure or upgrades, one cold for slow analytics-style queries) did just fine with millions (to tens of millions) of users. No downtime at all for years on end.

MySQL and Postgres are massively more widely-used than Cockroach and Spanner, broadly very successfully. It's entirely feasible to run it with high uptime.

Very few deployments experience actual failures. Could be some fridge-door/light situation going on.
> fridge-door/light situation going on

what does it mean ?

I think that is meant to be parsed as: Just like you can't check if the fridge light is on without opening the door (which of course turns it on), it's hard to know if a system is resilient to failure without having one. It just might be that there hasn't been a situation that would cause a failure.
This is probably one of the best motivations for a distributed database that I've read.

I find that it's not often that people grasp that distribution is about availability. It's obvious when you say it, but for a long time my own intuition was that distribution is about mostly durability or consensus protocols to provide total order across multiple machines. Yet these build together into availability.

In fact, I first noticed this distinction when reading Brian M. Oki's seminal 1988 paper on Viewstamped Replication, the work that would pioneer the field of consensus—a year before Paxos but with an intuitive protocol essentially identical to Raft. The surprising thing is that today many of us might have titled the paper something about "consensus" or "total order" (which it practically invented, and which was the major breakthrough, at least how to do this in the presence of network partitions) but that he titled it "Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems".

I did a short intro talk to Viewstamped Replication (and particularly why FTP or nightly backups or manual failover are not a solution): https://www.youtube.com/watch?v=_Jlikdtm4OA

The talk is followed by interviews with Brian M. Oki and James Cowling (authors of the 1988 and 2012 papers respectively).