| Oh, I’ve seen much worse than this. I truly believe system design interviews and Confluent marketing/sales have made Kafka a midwit trap: 1. You cannot just use Kafka for free. It will take dev time to set up itself, dev time to code sources and sinks, dev time to handle commonly glossed over but utterly important details like idempotency, retries, duplicate messages, consumed-but-not-committed (or whatever the term is in Kafka world) network interruptions or restarts of consumers. 2. You cannot just continue to use Kafka for free. Running it has an operational cost; this is mitigated using it as a PAAS but not fully solved as you’ll still need to twiddle configurations and scramble to deal with things like “We had no idea we’d need to handle idempotency or use deadletter queues, fixed it, but now need to deal with old data before our fix”. 3. There are many ways you can implement async producer:consumer patterns that are less complex and less costly than Kafka. For example, you can write data to S3. Or you can store records in a regular RDBMS. Kafka isn’t worth it unless you really need “real-time” ingestion but can’t/won’t/shouldn’t implement an actually-real time (synchronous) system instead, like if you get large spikes and are ok with ingestion going from O(seconds) to O(minutes) when that happens. 4. There’s a good chance you don’t need an async queue at all. If consumers can horizontally scale quickly (like with Lambda) why not synchronously invoke them over HTTP/RPC and only use async queues (or a file, etc) for messages that fail multiple retries? Since external users usually aren’t directly writing to your Kafka topic, and thus you have a degree control over your ingestion and consumption, why not just combine the two services (since you can experience data loss from external world to ingestion service anyway, and in fact this is a pretty likely source of failures, your queue may not even be solving the problem you think it is). If you don’t need ordering or partition and are using queues for eg config update propagation, why not just synchronously update consumers or implement basic polling in your consumers? 5. A lot of Kafka/Confluent “features” like retries or logging you can get with so many other tools and services but for some reason these can be the actual selling point more so than the fact it’s an async queue (that also has these features). Yes, in a FAANG design interview where it only costs you 5seconds to say “and we’ll use an async queue like Kafka between these components to handle variable load and partially consumed data” it’s a great tool that saves you a lot of time. And when you pretend integration and maintenance costs don’t exist, and don’t even know what idempotency means, and are sitting across from some slick Confluent salesperson telling you Kafka can be THE database of everything your company does with all these nice features, it sounds great to midwit managers and hasbeen architecture astronauts. In reality? The dumb unsexy alternatives probably solve your actual problem more simply |
your suggestion of single node rdbms as a replacement to kafka suggest you dont have experience with workloads that cannot be served by a single machine, yet you still need a single architecture.
agree that Confluent took a gread product that works for high load use case, and then tries to shove it to each and every average Fortune1000 enterprise use case with 100 users and traffic that could be well served by SQLite/Postgres on a single machine