Hacker News new | ask | show | jobs
by jonahx 3705 days ago
> What is Kafka?

> Apache Kafka is a distributed commit log for fast, fault-tolerant communication between producers and consumers using message based topics. Kafka provides the messaging backbone for building a new generation of distributed applications capable of handling billions of events and millions of transactions

Can anyone translate this into meaningful English for me?

6 comments

You should read this: https://engineering.linkedin.com/distributed-systems/log-wha... - it's long, but it's one of the most impactful essays I've read on software engineering in years.
I'm not able to find the one that made the light bulb go on in my head, but Martin Kleppman gives some good conf talks around this topic. This one looks promising - https://www.youtube.com/watch?v=GfJZ7duV_MM
I liked this one [http://www.confluent.io/blog/turning-the-database-inside-out...] a lot, because it highlights the strength of Kafka beyond a simple distributed message queue.
+1 regarding reading that essay on Kafka. The explanations, illustrations, evolution and use case are all well thought out.
You send a message (for example some JSON) to a Kafka topic. Any number of clients subscribe to that topic with a specific start time-stamp. Pluck a message off the queue, compute with it, send an acknowledgement. Kafka provides strong assurances that all readers get all the messages and report success (it retries otherwise), even if some participants come and go.

Very useful if, say, you have some real world event and dozens of different micro services need to do something about that event, independently.

You can also just use it for logging.

This is a really inaccurate description. Messages aren't indexed by timestamp in any meaningful way; that feature is currently under development. Messages don't need to be acknowledged, it's the client's responsibility to track what messages have been consumed. The server provides some facilities to make that easier, but ultimately clients can request whatever messages they want (repeating, skipping, whatever), as long as the messages haven't expired out of retention.

If you're actually interested in Kafka, just read the documentation, it's quite good.

Interesting. Those are features my previous employer had and used extensively, in particular stateless clients. I guess we added those layers ourselves.
Sounds like a mailing list.
Yes, but for applications ; instead of doing a remote call to your other system to create an order or send an email, you just stick the data in a queue, and the other system does it when it feels like it
We do exactly that for our emails. We're not ultra high volume, but we send millions a day.
Kafka is a distributed logging system. Write lots of data very fast by using sequential I/O. Consuming apps can then read this just as fast and maintain their own state (of where they last read up to) which allows for multiple fast and simple consumers and an easy way to have a lasting "log" of all the data.
It's a message queue. You use it for everything you want to do outside of the general request cycle. IE: Making API calls, priming cache, sending emails, etc..

Biggest competitors of Kafka are RabbitMQ and amazon SQS.

It's not a message queue, it's a logging system. Queues are meant for ephemeral messages that expire once consumed. Kafka is immutable log storage that can be read as many times as necessary by consumers.

Biggest competitors would be AWS Kinesis, Azure EventHubs and Google PubSub.

The biggest difference, IMO, is that Kafka is typically used when a message will be consumed by multiple consumers, whereas RabbitMQ or SQS generally send a message to a single consumer.

We use it to ingest ~40mb/s and fan it out to a number of consuming applications.

I'll also add that if you put some thought behind your topic replication and partitioning you can build some incredibly resilient applications. Also that "immutable" isn't necessarily true, it's common for Kafka topics to roll off messages based on time or size. (That's just to clarify for those not familiar with Kafka. I realize that you mean messages are not deleted or modified once written to a topic, other than by topic retention settings)

It's very easy to setup multiple consumers with RabbitMQ topics though and SQS is a very basic queueing system that just doesn't support much.

To me, Kafka is just meant for much larger magnitudes of scale and persistence (of the entire log of messages for however long you need) as a core feature.

Google's PubSub is still the best blend of traditional queue semantics with Kafka scale and persistence though.

It's incorrect to say that RMQ sends message to consumers. What it does is it routes messages to queues/exchanges. It's then entirely up to you to decide how many consumers will effectively consume them, I.e. You can have as many consumers as you wish.
messages are not deleted or modified once written to a topic, other than by topic retention settings

Except if you have log compaction turned on, I guess.

A key "selling point" of Kafka for me is that each consumer can decide from when they wish to receive messages. That is, you can replay the messages.

It's a distributed message queue.
It can be used as a queue but the bigger benefit is for streaming use cases. One of the key differences, among others, is that streaming assumes somewhat faster consumers as opposed to queueing. There's also the pub-sub use-case which is generally considered separate from that of a queue (considered a point to point transport).
That is more descriptive, but it still sounds like queue functionality. Streaming processing is just a queue that gets emptied quickly and pub-sub is just a set of queues.
Kafka doesn't generally get emptied quickly, but rather retains messages for a configured time/size. Because of this, consumers can choose to replay previously consumed messages, if they wish to do so.
You're right. I was mostly commenting on the common idiomatic ways ppl differentiate streams vs queues. Indeed, it can be used in both scenarios.