Hacker News new | ask | show | jobs
by mattboyle 2361 days ago
It was about 6 months ago.

I completely disagree with the opex of picking up kafka vs developing a whole client library. Please could you try and explain how you came to this conclusion?

2 comments

> Please could you try and explain how you came to this conclusion?

1. Stateless brokers

With Kafka any time a broker goes down you need to be aware of the kafka broker id. Yes, this can be fixed by creating your entire infrastructure as code and keeping track of state.

This is something of great OpEx. I've seen few people successfully automate this, Netflix is one of the few. The rest just use manual process with tooling to get around, pager, Kafka tooling to spawn replacement node with the looked up broker id, etc.

2. Kafka MirrorMaker

Granted I have not used v2 that recently came out in ~2.6 but dear gosh v1 was so bad that Uber wrote their own replacement from the ground up called uReplicator. The amount of time wasted on replication broken across regions is disgusting.

3. Optimization & Scaling

Kafka bundles compute & storage. There's (maybe on a upcoming KIP) no way that I know of splitting this. This means you'll waste time on Ops side deciding on tradeoffs between your broker throughput and your broker space.

Worse yet time & money will be wasted here. I'd just rather hire more people than waste time on silly things like this. This is where I justify taking on the expense of client libs.

4. Segments vs Partitions

The major time wasters are where you end up in a situation with the cluster utterly getting destroyed. It will happen, it isn't a question of if but a question of when or the company goes belly up and nobody cares.

It's 3 AM, the producer is getting back pressure, you get a page and now have to deal with adding on write capacity to avoid a hot spot. Don't forget you can't just simply do a rebalancement in Kafka or you'll break the contract with every developer who has developed under the golden rule of, "Your partition order will always be the same".

You'll successfully pay the cost of upgrading the entire cluster and then spending 3 days coming up with a solution to rebalance without making all your devs riot against you when you break that golden contract.

RIP Kafka

Having spent a couple of years dealing with Kafka I'm sorry to burst people's bubbles but is dead. Even Confluent doesn't have a good enough story these days to not switch to Pulsar, they're going to sell you on the same consulting bs, "We're more mature", "We've got better tooling.", "Better suppott"...

Yes, of course, it has been in the open source community 5 years longer and the company has been also around longer for that time. Kafka is dead, long live Pulsar.

I think what is dead is confluent cloud b/c Amazon MSK and Azure HDInsight will be close to feature parity at much less cost.
Damn, I got lazy on my reply & just hoped nobody went further, but well played on digging deeper.

5. Kafka is silly expensive

Pulsar supports message ack with subscription groups. The worst case with Pulsar is you're storing the entire retention period.

Let's say you have a 4 day retention window, to cover an outage happening on Friday and not having to deal with it until Monday. This is pretty typical with what I see in the Kafka world for small-mid size companies who don't want to pay the 1.5x OT on call.

So, with Pulsar you're at worst storing the 4 days of data but at best you're only storing the messages within the lag period of all consumer groups acknowledging the message.

Now, without getting too deep into Pulsar's feature set even that is a lie because Pulsar has tiered storage as a first class citizen. The messages after the four days could be ship off to S3 if we wanted or even within 1 day depending on our use case and this is all built into Pulsar, no OpEx tooling required. Even access the messages from S3 through Pulsar is abstracted, there's no tooling required to pull them back in if you wanted.

Now with Kakfa our worst case is simply 4 days of retention data. This can get very expensive as compute & storage are tied together, it means scaling up all the brokers (even though we don't need the throughput) for the storage increase. Now, yes MSK basically abstracts all this from you but you're paying for it.

6. AWS Managed Service are not equal citizens to EC2 standalone

Managed services right now don't fall under the new Saving Plan: https://aws.amazon.com/blogs/aws/new-savings-plans-for-aws-c...

This will cost you 30-60% discount on your entire Kafka bill.

7. Excel Life

If I look at the numbers for what I'm doing it would have costed ~$4M for Kafka vs ~$1M for Pulsar.

While bare metal Kafka does really bundle itself with lots of OpEx trouble, have you ever tried using an orchestrator to manage it ?

DC/OS implementation easily shuns out 1. and 2.

3. and 4. are valid points, but I think in a real life these scenarios are usually related to cloud service cost optimization, and I would never recommend anyone running Kafka in a cloud due to these reasons.

There one more reason, which was not cited, but poses itself a real killer for cloud Kafka dream AFM: clouds, being prone to all kinds of network interruptions, are not well suited for running Zookeeper ensembles with decent uptime.

Disclaimer: I have never tried or used Apache Pulsar, and just examining its documentation after spotting this thread.

> "using an orchestrator to manage it"

These can be just as fragile and now you have to learn how to manage the orchestrator. Even Confluent's own Kubernetes operator has issues. There's just too many issues with Kafka's design that hinders easy operations.

> "I would never recommend anyone running Kafka in a cloud"

That's a major problem considering that's where most computing is heading. At this point, running in noisey overloaded cloud environments is a good test of the reliability and durability of a software system. Kafka fails massively here.

I recently did a talk covering a lot of what I wrote: https://www.youtube.com/watch?v=jLruEmh3ve0