Hacker News new | ask | show | jobs
by siganakis 2978 days ago
Its great to see they are so open about their issues, but as a paying customer its a real shame how frequently they seem to suffer from outages and performance degradation.

I love the product, but hate that I can't rely on it.

1 comments

I share the sentiment. They do have some lessons to learn.... One being "test how long fully restoring databases from catastrophic scenarios will take and consider if you can afford it to take that long; if not take corrective action". People constantly seem to forget just how slow restores or syncing a new database replica can get, or assume they'll never need to.
Well, you can't really solve high availability problem by testing how long replication and restoration take. It's a lot more nuanced than that and requires real expertise in distributed systems. Usually it means getting rid of PostgreSQL/MySQL completely in favor of distributed solutions, as it's cheaper and is a better investment into the infrastructure, than attempting to build high availability on top of it.
You can get high-enough availability just fine on Postgres. Very few applications require zero downtime. With pgbouncer or similar in front, you can generally flip to a slave with very minimal impact. The issue comes in situations like the one in this case where a mistake leads to being left without up to date slaves and your system can't handle the read load on a single server.

I agree with you in principle, but for most systems it's total overkill. It wouldn't be total overkill if distributed solutions were easy to set up and without tradeoffs, but we're nowhere near being there.

In most cases then, restoration time is the biggest barrier to getting "high-enough" availability without re-engineering everything for a totally different system. Often you can prevent that from becoming an issue by siloing functionality into separate databases, offloading logs and analytics for example. Or buying faster SSDs for your DB servers... There are many approaches depending on the size of your dataset, and most people never outgrow those options.

To put it this way: Gitlab.com's database is small enough that fitting it in RAM on a commodity server is easily doable. While they'd still need to have snapshots on disk, at that point beating the restore speeds they're reporting would be trivial.