| I'm in the fortunate position of having been able to tell our story in detail on our blog after a major outage involving Cassandra and bootstrap behaviour that we didn't fully understand. This is a story of how I bought down the bank for two hours. https://monzo.com/blog/2019/09/08/why-monzo-wasnt-working-on... In summary, we were scaling up our production Cassandra data store and we didn't migrate/backfill the data properly which led to data being 'missing' for an hour. In a typical Cassandra cluster when scaled up, data moves around the ring a single node at a time. When you want to add multiple nodes, this can be an extremely time and bandwidth consuming process. There's a flag called auto_bootstrap which controls this behaviour. Our understood behaviour was that it will not join the cluster until operators explicitly signal for it to so (and this is a valid scenario because as an operator, you can potentially backfill data from backups for example). Unfortunately it was completely misunderstood when we originally changed the defaults many months prior to the scale up. Fortunately, we were able to detect data inconsistency within minutes of the original scale up and we were able to fully revert the status of the ring to it's original state within 2 hours (it took that long because we did not want to lose any new writes so we had to carefully remove nodes in the reverse order that they came in and joined the ring). Through a mammoth effort across the engineering team across two days, we were able to reconcile the vast majority of inconsistent data through the use of audit events. This was a mega stressful day for everyone involved. On the plus side though, I've had a few emails telling me that the blog post has saved others from making a similar mistake. |
Well, one of the junior devs just so happened to be playing around with various different Solr queries to see what he could get back and somehow issued a query that caused the entire staging database to fall over. That was a fun phone call to get. It wasn’t the junior dev’s fault, of course, but it really did wonders to expose the fragility of poorly optimized/unindexed queries against the database.
My experience in general with Cassandra is that outside of a few experts working with it, it was pretty poorly understood throughout the org and no one except those select people could really do anything when it all fell over.