Hacker News new | ask | show | jobs
by AndrewKemendo 2024 days ago
Alternatively from Jay Krebs [1] a much more thorough and nuanced discussion that is probably the best send-up on this topic.

"So is it crazy to do this? The answer is no, there’s nothing crazy about storing data in Kafka: it works well for this because it was designed to do it. Data in Kafka is persisted to disk, checksummed, and replicated for fault tolerance. Accumulating more stored data doesn’t make it slower. There are Kafka clusters running in production with over a petabyte of stored data."

[1] https://www.confluent.io/blog/okay-store-data-apache-kafka/

8 comments

That post explains that there are scenarios where it makes sense to store data permanently in Kafka. "Kafka is Not a Database" makes a different point, which is that Kafka doesn't solve any of the hard transaction-processing problems that database systems do, so it's not an alternative to a traditional DBMS. This is not a straw man---Confluent advocates for "turning the database inside out" all over their marketing materials and conference talks.
Indeed. The HN community has been quite negative on Confluent marketing Kafka as a database.

https://news.ycombinator.com/item?id=23206566

Apache Kafka also doesn’t refer to itself as a database in its own documentation:

https://kafka.apache.org/documentation/#introduction

It actually does with |Exactly-once Semantics| in fact I've been using as the single source of truth in a cash management system for almost 2 years without a single issue related to transactions.
How do you deal with side effects outside of the Kafka cluster?
Short answer: you write your consumer's state into the same DB as you're writing the side-effects to, in the same transaction.

Long answer: say your consumer is a service with a SQL DB -- if you want to process Event(offset=123), you need to 1. start a transaction, 2. write a record in your DB logging that you've consumed offset=123, 3. write your data for the side-effect, 4. commit your transaction. (Reverse 2 and 3 if you prefer; it shouldn't make a difference). If your side-effect write fails (say your DB goes down) then your transaction will be broken, your side-effect won't be readable outside the transaction, and the update to the consumer offset pointer also won't get persisted. Next loop around on your consumer's event loop, you'll start at the same offset and retry the same transaction.

Persisting message offsets in DB has its own challenges. The apps become tightly coupled with a specific Kafka cluster and that makes it difficult to swap clusters in case of a failover event.

If you expect apps to persist offsets then it’s important to have a mechanism/process to safely reset the app state in DB when the stored offset doesn’t make sense.

Interesting, do you have any resources about how to best handle side effects with message queues (e.g. GCP PubSub)? Trying to find out if its worth the effort, or good practice to allow replay-ability (like from backup) of message and get back to the same state
Thanks, this is a great article. The money quote for me is:

> I think it makes more sense to think of your datacenter as a giant database, in that database Kafka is the commit log, and these various storage systems are kinds of derived indexes or views.

My company supports analytic systems and we see this pattern constantly. It's also sort of a Pat Helland view of the world that subsumes a large fraction of data management under a relatively simple idea. [1] What's interesting is that Pat also sees it as a way to avoid sticky coordination problems as well.

[1] http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf

The problem is that Kafka API and Commit log API are very different.

If you wanted to literally use Kafka for your commit log the same way the Amazon aurora are using a distributed commit log. You would find that a lot of feature a commit log need are missing and impossible to add to kafka.

Greenspun's 11th law: Any sufficiently stateful program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of SQL / relational databases.
is that another definition for confirmation bias?

I can name so many use cases a relational database is incapable of...

Yes. as the article points out there's no way to reject an event based on some criteria that keeps the events internally consistent. for instance, two people can't check out the same item of inventory = 1. you need to first validate this event on the materialized view of the commit log to make sure the item is still available. however, what happens when your validation goes out of date? For example, what if there's an event checking the last inventory in the event log but the materialized view doesn't reflect that yet? you end up checking out the same inventory twice and promising it to both customers! not good.

DBs can solve this through optimistic locking (atomically making sure my new event applies to the version number of the previous event + 1, otherwise failing) but there's no way to do this in Kafka as far as I know (however, as this is a problem i'm currently facing, please feel free to let me know if there is)

I’d really love for this to be formalized.a standard format streaming the data through several Susie,s and keeping them eventually consistent.
> Accumulating more stored data doesn’t make it slower

That is a valid theory when we talk about readers which look at recent data or when you are trying to append data to the existing system.

But in practice, the accumulation of cold data on a local disk is where this starts to hurt, particularly if that has to serve read traffic which starts from the beginning of time (i.e your queries don't start with a timestamp range).

KSQL transforms does help reduce the depth of the traversal, by building flatter versions of the data set, but you need to repartition the same data on every lookup key you want - so if you had a video game log trace, you'd need multiple materializations for (user) , (user,game), (game) etc.

And on this local storage part, EBS is expensive to just hold cold data, but then replicate it to maintain availability during a node recovery - EBS is like a 1.5x redundant store, better than a single node. I liked the Druid segment model of shoving it off to S3 and still being to read off it (i.e not just stream to S3 as a dumping ground).

When Pravega came out, I liked it a lot for the same - but it hasn't gained enough traction.

Log compaction definitely isn’t problem free. I’d say it was crazy when he wrote that.

We ran a cluster with lots of compacted topics, hundreds of terabytes of data. At the time it would make broker startup insanely slow. An unclean startup could literally take an hour to go through all the compacted partitions. It was awful.

I recommend his book, "I heart logs". It's a short read, but changed my perspective.
OT: Are there any Kafka alternatives in Rust or Go or C? (for a non-JVM stack)
*Kreps, not Krebs
Yes sorry! Unfortunately too late to edit :(
Agreed. I feel like the tech community is afflicted with collective functional fixedness, or some sort of essentialism.

At it’s core it’s electron state in hardware. So long as those limits are not incidentally exceeded, and you validate outputs, who really cares what gets loaded?

While we rip rare minerals from the ground and toss all that at scale every 3-5 years later we get economical over installing software.

So long as it offers the necessary order of operations to do the work, whatever.

https://en.m.wikipedia.org/wiki/Functional_fixedness