|
|
|
|
|
by trengrj
2163 days ago
|
|
With Pulsar vs Kafka, I don't see a huge argument between either one functionality wise as they have so much in common (distributed log, Java based, avoid copying memory, use Zookeeper). Because Kafka is more supported and well-known it seems Pulsar needs to be an order of magnitude more performant to capture developer mindshare. I see the same with Spark vs Flink in that similarities outweigh differences. I wonder if this is some sort of emergent pattern in open source software. |
|
1. A single partition is stored in one node (replicas on another nodes). With this, introducing new nodes takes very long time to replicate large partitions, because it can replicate one partition from only one node (leader of the partition). On Pulsar each segment of partition is stored in a different bookkeeper node.
2. Because of 1, if two consumers read different parts of a partition that are far from each other, they will compete over disk bandwidth. In Kafka consumer can not read from replica node. If a topic is really popular and many consumers try to read from it (from different parts of the file which makes OS page cache useless), total consumption rate is limited to disk bandwidth of a single node. But in Pulsar each consumer can read from different brokers. Catch up consumers won't trash streaming consumers in Pulsar.
These are not problems that can be fixed easily. Additionally, in the realm of streaming the difference between Flink and Spark is day and night. The low watermark feature that Flink offers makes them behave fundamentally different.