Hacker News new | ask | show | jobs
by t90fan 1060 days ago
Running regular incremental repairs is the norm, as nodes will from time to time have trouble talking to each other due to real world network reasons, or will go down, for things like OS patching. We had a (daily) cron job for it. I come from the software side not the DBA side of things but my main advice from running Cassandra at scale in production (it was part of an Apigee stack) is don't basically! It was very not realisable, would consume huge volumes of memory (especially during repairs), bandwidth (doing a repair is very chatty as it has to sync lots of data) and disk space (tombstoning meant deleted records take up space until compaction runs), and was generally not much fun to manage, and it was difficult to hire people who knew much about it to do so. I would not build a solution myself using it going forward. We also had to periodically (weekly) do "full" repairs to work around Cassandra bugs, silent data corruption etc...
5 comments

We (as in, my company, not me myself) run large Cassandra clusters in the critical path of bank transaction processing (in the order of 2-25 million payments per day, each requiring a lot of database queries) and it's going pretty well...

https://www.youtube.com/watch?v=0QsLU9na2uE

But yes, you win some (mainly resilience, availability and disaster avoidance, possibly tunable consistency will help you) you lose some.

To do 2-25 million transactions per day you might as well use SQLite. Sounds like this was a career development push more than anything.
Transaction in this case is not a database transaction, but a financial transaction (payment). Per payment, probably somewhere in the order of 50-100 database transactions (although Cassandra does not really have transactions of course, interpret this as read/write actions) will be performed in the course of its processing. So that is 1,875,000,000,000 database actions on busy days. Not a DBA, but for our purposes the scalability and availability of Cassandra works very well.
There are three extra zeros here
Whoops you are right.
There is always a better solution than Cassandra, until your data will no longer fit on a single server or you actually need guaranteed availability.
Is there no other db that offers guaranteed availability?

I remember reading about Discord switching from Cassandra to ScyllaDB I think.

Yes, other distributed databases like MongoDB, CockroachDB, probably a few others. Or even multi-master DB setups. As with Cassandra, you don't want to use them unless you really need that availability and can suffer the downsides. It seems pretty rare to actually need those availability guarantees, rather than say a robust fast failover setup which might cancel some in flight transactions. It is probably when you start looking at two phase commit that you look for alternatives with better availability stories.
ScyllaDB Is just knock off cassandra with different features and performance characteristics.
25 million per day might be a little high for SQLite, especially if they don't spread out evenly over the day. You also get no redundancy or replication, which you might want if your database isn't to grind to a halt during backups.

Arguably Cassandra does sound like a weird choice, but we don't know the specifics of their setup. There's a lot of solutions presented on HN where SQLite and a Java application would have been a better choice and you can say for sure without knowing all the details, I feel like this is past that point.

No need to add the extra bit - at the top end 289 transactions per seconds is not something you'd probably want to choose SQLite for, but PG/MySQL/SQL Server would do that fine and require a lot less feeding (though any database with traffic or size needs some care.)
We run a Cassandra cluster in production and its a pretty small cluster yet all that you mentioned seems to resonate. We do use Cassandra reaper to automate some of the tasks but no one wants to touch Cassandra in general in the team.
Thanks for sharing your experience -- I know I've spent a lot of time in the past worrying about FS corruption, but generally expecting that the database sitting on top of it should never get corrupted, mostly because I use postgres so much.

I don't have the experience you do in this situation, but my first reaction to this was definitely "don't use Cassandra". But I also never really understood the use-case where Cassandra shines as a solution either (seems like only companies with a lot of data really seem to get wins from it?)

Can recommend https://cassandra-reaper.io/ for most of the management stuff you're mentioning. Still not free though, running Cassandra requires (some) effort in my experience.
Scaling up also takes up a lot of resources so you're never able to scale up in response to load without hosing your database even more.