|
|
|
|
|
by hibikir
1050 days ago
|
|
I have had to make that Clustered RabbitMQ to Kafka move myself, as the failure modes from RabbitMQ we're very scary. The most scary thing in the entire infrastructure in that financial institution levels of scary. It's not that it failed much, but you don't need many middle of the night calls with no good SOP to get the cluster back to health before migrating is in the cards. Kafka is not operationally cheap. You probably want a person or two that understands how JVMs works, which might be something you already have plenty of, or an unfortunate proposition. But it does what is on the tin. And when you are running fleets of 3+ digits worth of instances, very few things are more important. |
|
A distributed database will have network failures, will have conflicting writes, will have to either pick between being down if any of the network is down (CP) or you need a "hard/complex" scheme for resolving conflicts (AP). Cassandra has tombstones, cell timestamps, compaction, repair, and other annoying things. Others databases use vector clocks which is more complex and space intensive than the cell timestamps.
It's tiring to have move fast break things attitudes applied to databases. Yeah, sure your first year of your startup can have that. But your database is the first thing to formalize, because your data is your users/customers, you lose your data, you lose your users/customers. And sorry, but scaling data is hard, it's not a one or two sprint "investigate and implement". In fact, if you do that, unless you are doing a database the team has years of former experience with in admin and performance, you are doing it wrong.
"AWS/SaaS will eliminate it for me"
Hahahahaha. No it won't. It will make you life easier, but AWS DOESN'T KNOW YOUR DATA. So if something is corrupted or wrong or there is a failure, AWS might have more of the recovery options turnkeyed for you, but it doesn't know how to validate the success for your organization. It is blind trust.
AWS can provide metrics (at a cost), but it doesn't know performance or history. You will still need, if you data and volumes are any scale, how to analyze, replicate, performance test, and optimize your usage.
And here's a fun story, AWS sold its RDS as "zero downtime upgrades". Four or five years later, a major version upgrade was forced by AWS .... but it wasn't zero downtime. Yeah, it was an hour or so and they automated it as much as they could. But it was a lie. And AWS forced the upgrade, you had no choice in the matter.
Most clustering vendors don't advertise (or don't even know) what happens in the edge cases where a network failure occurs in the cluster but the writes don't propagate in the "grey state" to all nodes. Then the cluster is in a conflicted write state. What's the recovery? If you say "rerun the commit log on the out of sync nodes" you don't understand the problem, because deletes are a huge wrench in the gears of that assumption.
From my understanding of Cassandra, which kafka appears from the numerous times I've looked to be similar too with quorums and the like, it's built on a lot of the partition resilient techniques.
And, kafka has undergone Jepsen: https://aphyr.com/posts/293-jepsen-kafka
For those that don't know, aphyr will embarrass any distributed system given enough time. What is important is that
1) the distributed system is willing to subject itself to him and
2) they have a satisfactory response.
For an example of an unsatisfactory response, I give you MongoDB:
https://jepsen.io/analyses/mongodb-4.2.6
Note the "updates" section doesn't actually have them retry/repeat the testing. MongoDB just ran from the report. They claim it was fixed.
Anyway, if a system doesn't do that (submit to jepsen testing), then IMO it is hiding some big big big red flags.