| It amazes me how much better MySQL has been in this regard for at least a decade, and it's also amazing that it's still not that well-known today. Back in 2015 I worked at a fast-growing unicorn that had badly implemented basically everything because they started with a tiny ops team of grads and developers. Very little was being monitored, there were only a handful of metrics being graphed (mostly network stuff in Cacti). Our DB issues were all caused by stupid stuff : * undetected hard disk in array fails * battery in array controller fails * disk fills up * dubious backups, with no point-in-time recovery * extremely poorly written SQL queries * poory configured MySQL (in oh-so-many ways) The top three (at least) would lastly cause replication lag, which would eventually trigger an alert. ... And yet we never lost a cluster. (And we far a lot of them!) My team sweated blood improving processes and tooling, and then I spent a 6 month stint on database clusters (switching to GTID based replication and rewriting the ops config code so that they were all consistently configured and monitored). Occasionally we'd get a new senior hire insist that PostgreSQL was a necessity, so we'd stand back and let them produce a proof of concept that stood up to the types of failures our MySQL clusters dealt with regularly, without waking oncall up at night. And it was always a bit of a joke by comparison. |