| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by datadeft 1135 days ago
	I am not entirely sure what is the reason to make Kafka transactional. The original goal was to have a message queue that holds statistical data where the data loss cannot significantly alter the outcome of the analytics performed on the (often incomplete) data. Why are in this argument about fsync and such now? Did something change? If you need reliable data storage do not use Kafka or similar technologies.

3 comments

ako 1135 days ago

Do you have a source/link for that original goal? I wasn’t aware of this, and as such expect that I can rely on kafka for my events. Also, if this is really the case it should be mentioned on the homepage of Kafka.

Just checked kafka’s homepage, it mentions mission critical, durable, fault tolerant, stores data safely, zero message loss, trusted… Seems they’ve moved on from their original goal.

link

datadeft 1135 days ago

"The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds. This means site activity (page views, searches, or other actions users may take) is published to central topics with one topic per activity type."

https://kafka.apache.org/

link

skrtskrt 1135 days ago

Kafka is used widely as a persistent event store, and its development features reflect that.

Why would I not just turn on fsync or deploy in a distributed pattern for reliability so I can just continue using it instead of ripping it out, benchmarking something new, teaching the entire org something new, potentially negotiating a new contract, and then executing a huge migration?

link

datadeft 1135 days ago

Just like heroin is widely used a recreational drug. We live in a free world and you can use Kafka as a persistent reliable store, even use it transactionally.

Instead of reading the marketing claims I like to read what @aphyr has to say about data storage systems.

https://aphyr.com/posts/293-call-me-maybe-kafka

link

relay23 1135 days ago

Also https://jepsen.io/analyses/redpanda-21.10.1

link

relay23 1135 days ago

Are you sure performance would be acceptable if you just turned on fsync on every message?

link

skrtskrt 1135 days ago

well it obviously depends on your usage patterns.

But at a certain point any technology is going to reach the limits of what current hardware and operating system primitives can do.

fsync vs. distributed consensus vs. other tradeoffs w.r.t reliability and consistency are not inherent to "Kafka or similar technologies". It's inherent to anything that runs on a computer in the real world.

Generally unless your scale is mind-bogglingly big, the ROI on tuning what you already have is going to be way way bigger than just ripping it out because you read a benchmarking article.

link

tyingq 1135 days ago

>The original goal was to have a message queue that holds statistical data

I suppose that might have been the original goal, but the current tag line includes "data integration" and "mission-critical".

"Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications."

link

datadeft 1135 days ago

I guess you can add any feature to anything. I think this whole investor driven development is just sad.

link