Hacker News new | ask | show | jobs
by suhailpatel 1816 days ago
I'm in the fortunate position of having been able to tell our story in detail on our blog after a major outage involving Cassandra and bootstrap behaviour that we didn't fully understand. This is a story of how I bought down the bank for two hours.

https://monzo.com/blog/2019/09/08/why-monzo-wasnt-working-on...

In summary, we were scaling up our production Cassandra data store and we didn't migrate/backfill the data properly which led to data being 'missing' for an hour.

In a typical Cassandra cluster when scaled up, data moves around the ring a single node at a time. When you want to add multiple nodes, this can be an extremely time and bandwidth consuming process. There's a flag called auto_bootstrap which controls this behaviour. Our understood behaviour was that it will not join the cluster until operators explicitly signal for it to so (and this is a valid scenario because as an operator, you can potentially backfill data from backups for example). Unfortunately it was completely misunderstood when we originally changed the defaults many months prior to the scale up.

Fortunately, we were able to detect data inconsistency within minutes of the original scale up and we were able to fully revert the status of the ring to it's original state within 2 hours (it took that long because we did not want to lose any new writes so we had to carefully remove nodes in the reverse order that they came in and joined the ring).

Through a mammoth effort across the engineering team across two days, we were able to reconcile the vast majority of inconsistent data through the use of audit events.

This was a mega stressful day for everyone involved. On the plus side though, I've had a few emails telling me that the blog post has saved others from making a similar mistake.

2 comments

I have a Cassandra story as well. At a previous employer our org used a database that was a custom wrapper around Cassandra. This was a fairly large organization and this particular database was the keystone to the vast majority of operations of this particular organization. Well, one day I was giving a demo to some junior devs on how to use the REST API for the database which just so happened to take in raw Solr queries. I always liked to point that out to the newer devs as a way they could do some nice things that were otherwise fairly limited by the REST API.

Well, one of the junior devs just so happened to be playing around with various different Solr queries to see what he could get back and somehow issued a query that caused the entire staging database to fall over. That was a fun phone call to get. It wasn’t the junior dev’s fault, of course, but it really did wonders to expose the fragility of poorly optimized/unindexed queries against the database.

My experience in general with Cassandra is that outside of a few experts working with it, it was pretty poorly understood throughout the org and no one except those select people could really do anything when it all fell over.

After having spent many years working with it and interacting with it deeply, I would strongly recommend folks stay far away from Cassandra if you remotely care about your data. It provides way too many footguns to lose or corrupt or outright ruin your data.

Unless you work at Apple or Netflix or Spotify, finding Cassandra experts is going to be nigh on impossible and the community just isn't there unfortunately.

That’s a great writeup, thanks for all the detail!

I was always worried about something like this happening so only ever provisioned (via ansible) one server at a time. When the logs showed it was fully synced, we provisioned the next node. It could take two days to add 10 nodes but I always felt much safer

On the cloud, it is likely simpler and faster to just spin up a new cassandra datacenter, and then do a rebuild from the old datacenter to the new datacenter, either all nodes at once in parallel or in smaller batches. This procedure works fine regardless of using static tokens allocation or vnodes, and adds very little load to the old datacenter which is still serving traffic.
This is the standard approach and the one we have detailed runbooks for. We've scaled the cluster fine one at a time after this experience. It also prompted us to get a much better understanding of all the other flags that have been changed beyond the defaults.