Hacker News new | ask | show | jobs
by saryant 3368 days ago
We've had a Kafka system in production for maybe 7 years that deals with this problem.

With our use case, we can have unpredictable spikes in volume—which we must consume. Those spikes can be an order of magnitude larger than our baseline average. We put Kafka topics between every stage of our processing pipeline and configure the various Kafka clusters to be

1) Huge. Seriously, way overkill.

2) Able to sustain triple the largest spike we've ever seen without expiring data (size-based expiry)

Since most of our processing stages are essentially consuming, transforming and publishing back to Kafka, we've written them to not ack a message until the result of that stage has been safely published to the next Kafka topic. We require acks from all in-sync replicas. Since the subscriber part of a processing stage doesn't ack until its producer side has received acks from all ISRs, we're pretty confident in our data fidelity. In fact we have other infrastructure that verifies that everything coming out of this Kafka chain is correct and full-fidelity, so we know for certain that this setup can withstand huge spikes in volume without any load shedding.

And then we run all of it redundantly in multiple AWS availability zones to just be sure.

If any stage in our processing pipeline cannot keep up with increased volume, that's fine—it'll catch up eventually because we know that our retention policies are sufficient. And since (almost) every stage is run redundantly, even if one instance somewhere does become slow (or goes down), the redundant pipeline will keep data flowing so we generally have no customer impact. In fact if that does happen but the system as a whole keeps up, we don't even consider it a pageable event. If a machine falls over at 3am but it's redundant cousin keeps up, we'll fix it the next day during business hours.

(Redundant pipelines are also great for deploys—take down an entire side while you redeploy and you've now got zero-downtime deployments)