| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lobster_johnson 4067 days ago

Kafka does indeed have a good design. But it doesn't solve the potential transaciton gap between your store and the queue.

For example, if you commit a transaction but you're unable to reach the Kafka queue (because you crash, you're SIGTERMed, or there's heavy load causing a network blip, or any other number of reasons), you'll lose updates. You can't very well write to Kafka before you commit, because it's not visible yet outside the transaction.

The only way is to use a transaction log in the same database, in a way that lets the log be read after the commit is done. Logical streaming would let you do this (Bottled Water [1], as someone else here mentioned, does this with Kafka) in a safe way. It's conceptually identical to storing a transaction log table, but wouldn't require as much custom code, and you'd get incremental updates for free.

[1] http://blog.confluent.io/2015/04/23/bottled-water-real-time-...

1 comments

ComNik 4067 days ago

Yes, I fully recognize the problem with double-writing. I will definitely try out Bottled Water. I was also thinking about replacing Kafka with a much simpler, lower-throughput system (because we are lightyears from LinkedIn's requirements).

Two reasons why I can't just use postgres (I'd love to): 1.) Kafka (or whatever queue we settle on) will be used for logs and metrics as well, data that doesnt flow through postgres.

2.) Postgres stores the data-model of my business-domain, at the lowest, normalized level. But derived data-stores are inherently denormalized and I want to be able to use them without talking back to my source-of-truth all the time. So currently I'm passing DTOs to Kafka, just like I would to any API request. This data is not easily available at the postgres-level.

I'm not yet sure on the right abstraction level for events. It seems very natural to have them contain information that I would send to clients directly.

link

lobster_johnson 4067 days ago

So what's your "source of truth"?

We have an application that might be similar. It receives analytics events from frontends. It uses (currently) RabbitMQ to distribute it to multiple "sinks", including InfluxDB, ElasticSearch and websockets; the main sink is one that stores the events as flat files (one JSON hash per line) in S3. That's what we consider our master data.

link

ComNik 4067 days ago

For all application-data events I consider postgres to be the ground-truth. That is somewhat unfortunate, because one can't easily place a queue in front of the database. For metrics and logs, the Kafka topic itself (which is persisted similiar to your flat files) would become the master. The use-case is pretty similiar.

Might it be feasible to have something like postgres work with an external WAL? That would solve the problem I guess, as well as leave us with a single "persistent" system.

link