Hacker News new | ask | show | jobs
by datadeft 1135 days ago
I am not entirely sure what is the reason to make Kafka transactional. The original goal was to have a message queue that holds statistical data where the data loss cannot significantly alter the outcome of the analytics performed on the (often incomplete) data. Why are in this argument about fsync and such now? Did something change?

If you need reliable data storage do not use Kafka or similar technologies.

3 comments

Do you have a source/link for that original goal? I wasn’t aware of this, and as such expect that I can rely on kafka for my events. Also, if this is really the case it should be mentioned on the homepage of Kafka.

Just checked kafka’s homepage, it mentions mission critical, durable, fault tolerant, stores data safely, zero message loss, trusted… Seems they’ve moved on from their original goal.

"The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds. This means site activity (page views, searches, or other actions users may take) is published to central topics with one topic per activity type."

https://kafka.apache.org/

Kafka is used widely as a persistent event store, and its development features reflect that.

Why would I not just turn on fsync or deploy in a distributed pattern for reliability so I can just continue using it instead of ripping it out, benchmarking something new, teaching the entire org something new, potentially negotiating a new contract, and then executing a huge migration?

Just like heroin is widely used a recreational drug. We live in a free world and you can use Kafka as a persistent reliable store, even use it transactionally.

Instead of reading the marketing claims I like to read what @aphyr has to say about data storage systems.

https://aphyr.com/posts/293-call-me-maybe-kafka

Are you sure performance would be acceptable if you just turned on fsync on every message?
well it obviously depends on your usage patterns.

But at a certain point any technology is going to reach the limits of what current hardware and operating system primitives can do.

fsync vs. distributed consensus vs. other tradeoffs w.r.t reliability and consistency are not inherent to "Kafka or similar technologies". It's inherent to anything that runs on a computer in the real world.

Generally unless your scale is mind-bogglingly big, the ROI on tuning what you already have is going to be way way bigger than just ripping it out because you read a benchmarking article.

>The original goal was to have a message queue that holds statistical data

I suppose that might have been the original goal, but the current tag line includes "data integration" and "mission-critical".

"Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications."

I guess you can add any feature to anything. I think this whole investor driven development is just sad.