Hacker News new | ask | show | jobs
by twa927 2590 days ago
Can someone provide actual high-level use cases for using Kafka? Prefereably use-cases not handled by RabbitMQ.

I've seen a few talks about Kafka but they focused on the internals. My guess is that Kafka is for large systems for which managing a multi-node RabbitMQ cluster is too much trouble.

5 comments

I’ve long had the inverse view - I’m not sure what good use cases there are for Rabbitmq that couldn’t be handled better by a Kafka cluster.

One company I worked with used Kafka as their central source of truth across the organisation. All events generated by users were thrown into a massive Kafka cluster. Each team in the organisation cared about a different view into that data (financials, marketing, fraud, what we display to that user on the website, etc). Each team would ingest the same kafka queue and do different things with it - often consuming certain events into their own Postgres instance, or other things like that.

I used Kafka when I made my reddit r/place clone a few years ago because it gives great read and write amplification. With Postgres as a central source of truth, you can only handle thousands of writes per second. And reads will slow down the instance. With Kafka you can handle about 2M/sec. And reads can really easily be serviced from other machines - you can just have a bunch of downstream Kafka instances consuming from the root, and serving your readers in turn.

It may be that you can also solve all these problems with a well configured rabbitmq cluster. But coming from a database world I find it more comfortable to reason about architecture, performance and correctness with Kafka.

Size? If you’re getting less than a few hundred events a minute is it worth setting up Kafka?
This is the main reason I don’t use much Kafka in my own projects. I hope at some point someone makes a redis equivalent of Kafka for small projects.

Is Rabbit much easier to set up for small projects? I haven’t used it much.

You might be interest in Redis Streams[1], it's basically Kafka in Redis.

[1] https://redis.io/topics/streams-intro

If you're in AWS, you can use Kinesis which is similar to Kafka. It also ties into a lot of their other offering such as:

* s3 - use kinesis firehose to take the contents of your kinesis stream and time partition it into files for either ingestion into redshift, elastic search, etc... or later batch analysis for ML or just to treat as cold searchable storage with something like Athena

* dynamoDB - spit out the data into kinesis from dynamoDB as it changes to create a change stream used elsewhere in your platform. (dynamo-streams)

* real time analysis - perform real time sql analysis (kinesis analytics) on what's in your stream over a given window of time or data, and react as events/situations occur.

Looking at all the services that amazon has built around kinesis might help you understand some of the differences between kafka and something like RMQ.

Sounds like your org used Kafka for event sourcing. This is almost always a bad idea, event sourcing and aggregate reconstruction is a nightmare IMO.

Kafka used as a pure FIFO cache for regular CRUD endpoints works fine

Event sourcing was one of LinkedIn use cases when they created it, Kafka is fine for all logging needs.
Yes; they did. It worked pretty well actually.

Why do you think it’s a bad idea? Most of the arguments against event sourcing that I’ve read seem to be “yes but the tooling isn’t very good”. That might be true, but maybe we solve that problem with more investment into event sourcing; not less of it.

TLDR the tooling is so bad it's basically impossible to run at scale. I worked for a company that tried. Maybe on a small scale it's fine, but replays and storage of past events takes insane amounts of space at high event rates. To the point that storage costs and replay times became a real problem. (Many terabytes and days)

I also don't think it's a great idea in general. The event stream directly replicates a DB commit log, and the aggregates your tables. It's building your own database.

We had to throw a year's worth of work away at the end so I'm fairly biased against trying it until the ecosystem is better.

Kafka is a high-throughput, horizontally-scalable blob data store for data streams. The data store part of that is my favorite part.

You can use it as a simple message broker, but since it keeps the message history as a timeseries, you can also do things like run batch analysis jobs on the day's message or replay the last X hours of messages because your DB died and your backup is old.

It is a good way to decouple data producers and data consumer, particularly in an enterprise context - producers push to Kafka and anyone can consume that data, whether they are an operations team that wants a realtime data stream, a BI team that needs periodic data dumps, or a team that wants a long-term audit trail (the duration of the history is going to depend on your scale, but for many users a long history is realistic).

Kafka also has a nice ecosystem including streaming analytics (KSQL), clients that make reading from Kafka easily horizontally scalable (have many machines acting as a single client, automatically rebalancing if one of those machines dies), exactly once processing and probably more since I last worked with it.

I'm not familiar enough with RabbitMQ to say how it compares to Kafka, but I haven't found a use case yet where Kafka isn't a good choice (except for the 'I need to set up a message broker quick and painlessly' use case because it is not a particularly fun technology to manage yourself)

I skimmed over it but it's again mostly about internal design. The high level use case I see is "publish-subscribe" which is handled by RabbitMQ and a dozen of other solutions.

One use case I see is that the events published into Kafka are persisted so e.g. some component can see a history of some events (so this is something not handled by RabbitMQ). Is it right?

Event streams between decoupled systems is kind of the sweet spot for Kafka. It's extremely easy to scale horizontally, and handles distributed work and network partitions in an easy to reason about way. I've also seen Rabbit be the bottleneck before where I've never really seen Kafka be the bottleneck in an architecture - it's very analogous to a firehose. For organizations shuttling messages and events between teams, it makes a very convenient lingua franca.
I often do the same thing... skim through articles and papers to get the gist. Trust me, this is one to ready all the way through.
That's correct - being able to "rewind" a history of a topic (queue) is a powerful concept. But, Kafka is a bit harder to operate than RabbitMQ in my experience. (Somewhat related, one of our subsystems originally was built around Kafka, but later was migrated to RabbitMQ and Postgres)
A financial exchange - order messages are routed to Kafka and partitioned by the instruments symbol, match engines associated with a given set of symbols consume from their assigned partition. When a match-engine goes down it can reconstruct the order book by replaying from a given offset.
It's basically a high speed transaction log, persistent, distributed, easily scalable, that happens to store messages to do messaging brokering very well.