Hacker News new | ask | show | jobs
by Darkstryder 2443 days ago
> Prematurely designing systems “for scale” is just another instance of premature optimization

> Examples abound: (...) using a distributed database when Postgres would do

This is the only part of the article that bugged me a little, because in my experience the choice between single-machine and distributed databases is not so much about scale as it is about availability and avoiding a single point of failure.

Even if your database server is fairly stable (a VM in a robust cloud for instance), if you use Postgres or MySQL and you need to upgrade to a newer version of the database (let say for an urgent security update), you have no choice but to completely stop the service for a few seconds / minutes (assuming the service cannot work without its database).

Depending on the service and its users, this mandatory down-time might or might not be acceptable.

Anecdotally I suspect services requiring high SLAs are more common than ones requiring petabyte scale storage.

7 comments

Re availability: We had a hard time keeping the system based on Spark available. There were days when the cluster would freak out multiple times in a single day. The 'fix' would be: restart a bunch of spark workers. We spent a lot of time debugging/finding this out (some parts documented in [1]) but couldn't work out what the problem was. (EDIT: Assuming there even was a single problem.)

In this particular case, I'd take the single point of failure over the previous situation.

That being said: we have successfully used PostgreSQL's fail-overs multiple times. In my experience, they work quite alright.

[1]: https://tech.channable.com/posts/2018-04-10-debugging-a-long...

Yeah, I agree. It was more of a general comment, because you seem to have one Postgres instance for every client, which is already a big step against SPOF.

At $previous_job we had a "one service" = "one MySQL instance" policy. Every time a MySQL server would go down all clients would all lose access to that service at the same time. It was stressful and much less robust than your setup.

High availability Postgres setups are a minimum for a production system and a staging system to understand how your system behaves during a failure event. These failure scenarios should be tested not necessarily on every commit but often enough that there’s confidence during a failover you’re not going to drop queries on the floor and pretend it’s all good as well as your monitoring systems report on the event for the sake of event reporting.
Yeah. I guess after having been bitten myself a few times with failed MySQL failovers and especially after having read the GitHub October 2018 incident postmortem [1], I stopped considering failover solutions as a reliable availability solution altogether.

However this is just a personal opinion that I might revisit at some point.

[1] https://github.blog/2018-10-30-oct21-post-incident-analysis/

High availability setups are absolutely required to upgrade / patch running databases as well without significant downtime. The engineering and business costs in time to try to work around these issues are from the 90s and have no place in a modern business environment. Heck, they figured out HA decades before then in commercial, proprietary DBs. Things are much more reliable now with OSS tools than even 4 years ago to the extent few talk about it anymore. There are definitely mistakes and bugs possible but the number of _successful_ failover and failback events must be considered in the calculus.

Upgrades aren’t to be taken lightly of course but again, it’s now a cost of doing business and a reality that we need to engineer properly for.

I had the same reaction... this statement seems to be an over-generalization, but it can be resolved by being careful about what 'premature' means. In this case, "we can trivially shard datasets of different projects over different servers, because they are all independent of one another", so it seems the scaling issue had a solution from the outset.

Warnings about the risks of premature optimization should not stop people thinking about issues that are likely to arise in future, and what they might do about them. On the other hand, this does not mean that you should necessarily implement that solution now.

You can use a replication setup for Postgres which gives you a pretty good availability.

Also, Spark has some single point of failures that are not obvious at first.

https://gist.github.com/aseigneurin/3af6b228490a8deab519c6ae...

Downtime is not mandatory while upgrading to a new PostgreSQL version. I've upgraded from 9.0 to 11 and most versions in between without downtime on large busy databases.

There's not a single approach that works for every case, but they all involve a replica and upgrading one at a time.

You don't need a distributed database to have replication and a hot backup. Stackoverflow runs this kind of configuration -- if it works for them, it can work for you.
> Depending on the service and its users, this mandatory down-time might or might not be acceptable.

True. But if it is acceptable, then it may well be a good trade-off to make.

I agree completely. But it is a trade-off you need to be aware of, and need to be confident it will still be true in the future.

I've had situations where a client wanted a 99.95% availability SLA for our SaaS service instead of 99.5% as we were providing at the time.

He thought it was a minor change on our side, but it required completely rethinking our architecture from the ground up.

Yeah, makes sense. Conversely, our SLA is only only 98%! (we actually have much better uptime than that and have only had about ~1 or 2 hours of downtime in the last 6 month - but we have plenty of wiggle room if we need it)