Hacker News new | ask | show | jobs
by trengrj 2163 days ago
With Pulsar vs Kafka, I don't see a huge argument between either one functionality wise as they have so much in common (distributed log, Java based, avoid copying memory, use Zookeeper). Because Kafka is more supported and well-known it seems Pulsar needs to be an order of magnitude more performant to capture developer mindshare.

I see the same with Spark vs Flink in that similarities outweigh differences. I wonder if this is some sort of emergent pattern in open source software.

3 comments

There are real differences among them. Here is some painful aspects of Kafka:

1. A single partition is stored in one node (replicas on another nodes). With this, introducing new nodes takes very long time to replicate large partitions, because it can replicate one partition from only one node (leader of the partition). On Pulsar each segment of partition is stored in a different bookkeeper node.

2. Because of 1, if two consumers read different parts of a partition that are far from each other, they will compete over disk bandwidth. In Kafka consumer can not read from replica node. If a topic is really popular and many consumers try to read from it (from different parts of the file which makes OS page cache useless), total consumption rate is limited to disk bandwidth of a single node. But in Pulsar each consumer can read from different brokers. Catch up consumers won't trash streaming consumers in Pulsar.

These are not problems that can be fixed easily. Additionally, in the realm of streaming the difference between Flink and Spark is day and night. The low watermark feature that Flink offers makes them behave fundamentally different.

1. is true, but if you want that data to move to a new node, it still needs to be replicated. Kafka's approach is to use tiered storage (which I believe is close to completion).

2. Kafka can read from a replica node. It's relatively new but it's there.

That's true but still limitation is not fully resolved. In order to increase consumption rate, we need to add replicas. In pulsar Brokers are merely cache nodes over Bookkeeper. Adding more Brokers is trivial in Pulsar.
How in pulsar do they get around the fact that adding a new broker, data needs to be moved over before that broker can start serving data? This seems like a basic law of physics type limitation to me.
Hey, I work on Pulsar, will try and answer this :)

Topics (actually bundles of topics, called bundles) are what is assigned to Brokers. Topic assignment is dynamic, so when a new broker is added, the system will try and shed load from the busiest brokers to even it out on the system.

But unlike Kafka, when a topic is assigned to a broker, it doesn't have much state to move, mostly it just gets metadata added to it and opens a new "ledger" (which is just a chunk of the topics data over a time window, only one ledger is ever open at once). When it needs to serve data, it pulls that from bookkeeper nodes from previous ledgers, so the process of re-distributing load is pretty quick, it also doesn't eagerly pull in a cache.

Now, as far as the cache, that is primarily for "tailing reads", meaning, as writes occurs, and clients who are close to the tip of the recent data will just get it from the broker, without a need to pull it from bookkeeper. This is is one of the key parts about how Pulsar has multiple tiers of storage that help it have such good consistent latency.

Beyond processing writes, the biggest thing brokers do is handling "tailing reads" i.e., clients are consuming right near the tip of the topic. , this is the cache referred to. That means that when a new pbroker is three purposes:

1. Handling writes

(copying this text from another comment of mine elsewhere)

Well, the Pulsar broker is (kinda) stateless, because they are essentially a caching layer in front of BookKeeper. But where's your data actually stored then? In BookKeeper bookies, which are stateful. Killing and replacing/restarting a Bookkeeper node requires the same redistribution of data as required in Kafka’s case. (Additionally, BookKeeper needs a separate data recovery daemon to be run and operated, https://bookkeeper.apache.org/archives/docs/r4.4.0/bookieRec...)

So the comparison of 'Pulsar broker' vs. 'Kafka broker' is very misleading because, despite identical names, the respective brokers provide very different functionality. It's an apples-to-oranges comparison, like if you'd compare memcached (Pulsar broker) vs. Postgres (Kafka broker).

Network is faster than disk. Once cached, then you are only bound by network IO for subsequent uses.
Sure- but how is this different than kafka's caching?
Pulsar is better for very large scale deployments provided you have people to manage it
Kafka is handling very large scale deployments just fine atm in all the big tech co's.

The only thing I can see that can make this true is Pulsar seems to have better elastic scalability. But it seems to score less on everything else. It has a much more complex storage system that ends up not matching Kafka's high-end throughput at large scale.

From what I recall, Twitter ended up abandoning BookKeeper due to storage scale concerns. Related: https://blog.twitter.com/engineering/en_us/topics/insights/2...

This is mostly due to the difficulties scaling DistributedLog more so than BookKeeper. DistributedLog basically had no contributors other than Twitter and was just too big of a mountain to climb alone. The blog post you linked goes somewhat into this but that is ultimately why the choice to transition away was made.

Pulsar likely would have been considered if it was more mature at the time and sported a community of comparable size to Kafka (it's still a long way from this).

Show me one
>it seems Pulsar needs to be an order of magnitude more performant to capture developer mindshare.

Just to add to this, ease of use/setup is also a huge factor. There are technologies I can just spin up with zero knowledge and learn as I go. These are huge factors in adoption especially with Golang and nodejs.