| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by boredandroid 4038 days ago

One thing to realize is that a partitioned log is a generalization of an unpartitioned log (i.e. if you set # partitions = 1 in a partitioned log you have an unpartitioned log).

In Kafka the purpose of partitions is to provide computational parallelism not model entities in the world. So if you have 100m users you would map that into a number of partitions based on your computational parallelism (maybe 10-100 machines/processes/threads). In other words you would have a single topic partitioned by user id, not a topic per user.

If you have a centralized relational database that maps reasonably well to a single partition log (both in terms of scalability and guarantees).

For distributed databases you generally don't have a total order over all operations. What you usually have is (at best) a per partition ordering, which maps well to a partitioned log as well.

For applications that record events (logging or whatever) it is natural to think of each application thread or process as a kind of actor with a total order.

1 comments

tristanz 4038 days ago

Yeah, this is a good way to look at it. But it's also my point. Partitions are about parallelism and don't always fit the data model or domain. While you can reduce the partitions to 1, this is limits parallelism. It's not always easy to design a partitioning scheme that preserves the replication semantics you want. And these semantics vary from having something totally ordered, to having something that you can replay on a very fine-grained level (to replicate to the client for instance.) Most discussions of the advantages of logs really emphasize how amazing a totally ordered log is for replication, but that's not actually what production deployments look like so you still need to think carefully about what happens when writes are being applied to your datastores without a clear order.

lobster_johnson 4037 days ago

What Kafka gets right is that it explicitly addresses the fact that if you add parallelism, order goes out of the window (modulo partition key).

This is something that tends to surprise developers early on (myself included, years ago). But plenty of people still use queue solutions like RabbitMQ without thinking it all the way through.

Unfortunately, partitioning introduces a design step that makes it a little harder to make processing generic. With RabbitMQ you just post to an exchange and let queues (ie., consumers) filter on the routing keys; if no queues have been bound, for example, messages don't go anywhere. If you want, or don't want, parallelism, you just run either multiple consumers or just one. With Kafka, you need to decide beforehand, and design the "topology" of your log carefully, not just for the producer, but for each consumer. When producers and consumers are different apps, this starts smelling like a violation of the principle of "separation of concerns".

I rather wish Kafka had a better routing mechanism, actually. I don't see any reason why it couldn't have routing keys, just like RabbitMQ.