| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ComNik 4059 days ago

Thank you for your detailed thoughts. You obviously have much more practical experience with this kind of system.

Many of the problems you mentioned I am aware of, and also have no workable solution yet (detecting lost messages being the biggest - Merkle-tees sounds like a very interesting approach, maybe even applied at the log-level?).

As mentioned in another reply, Kafka does support the kind of "pointer-to-log" setup you mention. Also Kafka is designed for lots of consumers, each with different characteristics. In principle, I should be able to sync something like memcache with the same information I need to sync Elasticsearch. The same holds for a websocket-server that reads from this stream and forwards new events to web-app clients. So I don't see the need for more than one "queue" yet, maybe that will show up in practice.

Also your setup would require a lot more coordination to handle updates from multiple postgres instances, if I understood correctly.

That being said, I'm still in the experimental phase with all of this, I will publish a writeup once I gain a bit more experience.

1 comments

lobster_johnson 4059 days ago

Kafka does indeed have a good design. But it doesn't solve the potential transaciton gap between your store and the queue.

For example, if you commit a transaction but you're unable to reach the Kafka queue (because you crash, you're SIGTERMed, or there's heavy load causing a network blip, or any other number of reasons), you'll lose updates. You can't very well write to Kafka before you commit, because it's not visible yet outside the transaction.

The only way is to use a transaction log in the same database, in a way that lets the log be read after the commit is done. Logical streaming would let you do this (Bottled Water [1], as someone else here mentioned, does this with Kafka) in a safe way. It's conceptually identical to storing a transaction log table, but wouldn't require as much custom code, and you'd get incremental updates for free.

[1] http://blog.confluent.io/2015/04/23/bottled-water-real-time-...

link

ComNik 4059 days ago

Yes, I fully recognize the problem with double-writing. I will definitely try out Bottled Water. I was also thinking about replacing Kafka with a much simpler, lower-throughput system (because we are lightyears from LinkedIn's requirements).

Two reasons why I can't just use postgres (I'd love to): 1.) Kafka (or whatever queue we settle on) will be used for logs and metrics as well, data that doesnt flow through postgres.

2.) Postgres stores the data-model of my business-domain, at the lowest, normalized level. But derived data-stores are inherently denormalized and I want to be able to use them without talking back to my source-of-truth all the time. So currently I'm passing DTOs to Kafka, just like I would to any API request. This data is not easily available at the postgres-level.

I'm not yet sure on the right abstraction level for events. It seems very natural to have them contain information that I would send to clients directly.

link

lobster_johnson 4059 days ago

So what's your "source of truth"?

We have an application that might be similar. It receives analytics events from frontends. It uses (currently) RabbitMQ to distribute it to multiple "sinks", including InfluxDB, ElasticSearch and websockets; the main sink is one that stores the events as flat files (one JSON hash per line) in S3. That's what we consider our master data.

link

ComNik 4059 days ago

For all application-data events I consider postgres to be the ground-truth. That is somewhat unfortunate, because one can't easily place a queue in front of the database. For metrics and logs, the Kafka topic itself (which is persisted similiar to your flat files) would become the master. The use-case is pretty similiar.

Might it be feasible to have something like postgres work with an external WAL? That would solve the problem I guess, as well as leave us with a single "persistent" system.

link