|
The main struggle I have with Kafka is managing partitioning to avoid hotspots. I have a scenario where we have hundreds of installs of an old and shitty RDBMS on customer sites, and we need to replicate changes to data to a central store. We had to come up with a bespoke event system that would capture insert, update, and delete events, ship them to a REST endpoint, who would then throw them into Kafka to be processed into the central store. Kafka’s ordered message log made it ideal for this scenario, as we can’t play events out of order (although because of poor design in the old databases, it sometimes happens and we built a retry system using additional Kafka topics, nonetheless avoiding out of order messages is critical to keep consumer lag under control). This works mostly ok, but we have a problem when individual customers have big bursts of traffic. Ultimately, we need records to be processed in the order they happen, per customer. Naively, we could partition by customer ID, but arbitrarily adding new partitions as we add customers is not practical over time, and regardless, bulk inserts, updates, etc. could cause large amounts of latency for a customer. So, we’re doing a balancing act of trying to partition using customer ID + a “bundle name” of related tables (the net effect being activity to dependent tables for the same customer always go to the same partition and thus process in order). We’re also looking at using additional topics to create high, medium, and low priority queues, but while that may smooth out some of the problems, it really only breaks the original problem into three smaller versions of the same problem, effectively kicking the can down the road. Ultimately, the best solution would be to get rid of the crappy RDBMS and replace with something that we can binlog or otherwise sync transactionally rather than record by record. We are working on this, but it’s slow going. In the meantime, we continue to wrestle with Kafka partitioning woes. As an aside, we also got rid of Avro. It just didn’t have any benefits that outweighed the challenges to get it and keep it working over time. Much easier to just use plain json, a common message class library between consumers and producers, and a fast, traditional json library. I’ll fully admit that perhaps the avro woes are more an issue inexperience, but I seem to find more people who have the same experience as me than not. Either way, plain json has not caused us any problems. |