Hacker News new | ask | show | jobs
by opportune 1004 days ago
Oh, I’ve seen much worse than this. I truly believe system design interviews and Confluent marketing/sales have made Kafka a midwit trap:

1. You cannot just use Kafka for free. It will take dev time to set up itself, dev time to code sources and sinks, dev time to handle commonly glossed over but utterly important details like idempotency, retries, duplicate messages, consumed-but-not-committed (or whatever the term is in Kafka world) network interruptions or restarts of consumers.

2. You cannot just continue to use Kafka for free. Running it has an operational cost; this is mitigated using it as a PAAS but not fully solved as you’ll still need to twiddle configurations and scramble to deal with things like “We had no idea we’d need to handle idempotency or use deadletter queues, fixed it, but now need to deal with old data before our fix”.

3. There are many ways you can implement async producer:consumer patterns that are less complex and less costly than Kafka. For example, you can write data to S3. Or you can store records in a regular RDBMS. Kafka isn’t worth it unless you really need “real-time” ingestion but can’t/won’t/shouldn’t implement an actually-real time (synchronous) system instead, like if you get large spikes and are ok with ingestion going from O(seconds) to O(minutes) when that happens.

4. There’s a good chance you don’t need an async queue at all. If consumers can horizontally scale quickly (like with Lambda) why not synchronously invoke them over HTTP/RPC and only use async queues (or a file, etc) for messages that fail multiple retries? Since external users usually aren’t directly writing to your Kafka topic, and thus you have a degree control over your ingestion and consumption, why not just combine the two services (since you can experience data loss from external world to ingestion service anyway, and in fact this is a pretty likely source of failures, your queue may not even be solving the problem you think it is). If you don’t need ordering or partition and are using queues for eg config update propagation, why not just synchronously update consumers or implement basic polling in your consumers?

5. A lot of Kafka/Confluent “features” like retries or logging you can get with so many other tools and services but for some reason these can be the actual selling point more so than the fact it’s an async queue (that also has these features).

Yes, in a FAANG design interview where it only costs you 5seconds to say “and we’ll use an async queue like Kafka between these components to handle variable load and partially consumed data” it’s a great tool that saves you a lot of time. And when you pretend integration and maintenance costs don’t exist, and don’t even know what idempotency means, and are sitting across from some slick Confluent salesperson telling you Kafka can be THE database of everything your company does with all these nice features, it sounds great to midwit managers and hasbeen architecture astronauts. In reality? The dumb unsexy alternatives probably solve your actual problem more simply

1 comments

from your reply it seems like you have not worked with high throughput workloads that FAANG deals with daily.

your suggestion of single node rdbms as a replacement to kafka suggest you dont have experience with workloads that cannot be served by a single machine, yet you still need a single architecture.

agree that Confluent took a gread product that works for high load use case, and then tries to shove it to each and every average Fortune1000 enterprise use case with 100 users and traffic that could be well served by SQLite/Postgres on a single machine

Not only have I worked on some of the highest throughput workloads in FAANG, the ones I’ve worked on involve consuming from async queues (including Kafka) as a first class feature and use additional queueing under the hood to make the system work.

Kafka is overkill for the vast majority of users, including many in FAANG, and at companies in FAANG that I’ve been at, usually either an internal async queue or a bespoke solution for the problem is used instead. But more importantly, async queueing via Kafka-like technology itself is way over-applied. Polling a single node RDBMs with a horizontally scaling cache in front of it is a fine way to propagate eg config changes at massive scale. And many of the “real-time ingestion” use cases that people love using Kafka for are much better off simply being synchronous, with engineering effort instead focused on the ability to rapidly scale consumers up and down.