Hacker News new | ask | show | jobs
by lbradstreet 2951 days ago
PubSub is fundamentally a different abstraction than Kafka which is log orientated. With Kafka you can maintain the ability to replay your log/stream, in order, at any time.

This enables different kinds of use cases, and can be easier to reason about.

1 comments

Super interesting, I admit I haven't spent much time understanding Kafka---it seems like it's almost a hybrid between a message queue and a database?
Kafka allows clean decoupling of event producing and consuming logic unlike typical message queues. Your clients can keep pushing new messages to Kafka's persistent logs without having to know who and what will process those messages.

This makes it possible to easily update your event processing logic or add new components to the mix while maintaining a clean architecture.

I gave a talk about this at PostgresConf US a few weeks ago, the talk's not yet available as an article, but you can get the slides from https://aiven.io/blog/aiven-talks-in-postgresconf-us-2018/ if you're interested.

Yes. Kafka is a horizontally-scalable persistent queue (among other things). This sounds kinda pedestrian (like something out of a computer science textbook), but people are rapidly discovering uses for it as a systems architecture component.

The concept of "stream-table duality" tells us tables can be thought of as a materialized view of change data operations [INSERT/UPDATE/DELETE] streams. Kafka can be used as a buffer for streaming data that can materialized into a relational table at any time.

One of the more interesting use-cases is multi-target replication: feed your change-data-capture data into Kafka, and replay it on any other backend data store (SQL, Graph Database, NoSQL, etc.)[1]

Conceptually this lets you ingest data into a stream, write the data on multiple backends and keep everything in sync. Martin Klepmann has written a simple (PoC-quality) tool for doing this with Postgres databases.

[1] https://www.confluent.io/blog/bottled-water-real-time-integr...

Yes, I like to think of it as as the transaction log without any of the logic. If you’re interested, I can highly recommend Martin Kleppmann’s talk “Turning the database inside out” https://www.youtube.com/watch?v=fU9hR3kiOK0
I thought you used it at Fivetran? It's a logging system where you send messages as a key/value of bytes to a topic, which is just a logical group of partition files spread over the nodes as master + replicas.

Consumers read from the topic (the underlying partitions) and maintain the offset last read themselves, allowing for easy replay, strict ordering within a single partition, and completely disconnected pacing from producers. The optional "key" for each message allows for compaction so only the latest key/value pair is kept in a topic.

It's definitely not a database but works well for replicating them, as well as doing most of the work of a typical message queue / service bus system.

Nope, no Kafka at Fivetran, nearly all the sources we pull from are intrinsically batch, support retry, and come out of the source already partitioned by user. Our internal architecture is basically just a farm of batch processors.

We have been hearing more and more about people using Kafka to support streaming analytics. I haven't spent a ton of time studying it, I was under the impression that it was just a giant queue, but ordering + retention is basically a database in the way I think about it.

If consumers need to keep track of their own offsets, do you run into complications when trying to run multiple instances of the same consumer? Or do you typically run single processes when consuming a specific topic?
This is what partitions are for. They are used to scale out both publishing and consumption by assigning each consumer a single partition. This logic used to be complicated client-side code but now is included into Kafka itself, along with automatic offset saving on some interval (saved as entries in another topic).

Also consumers can be in a "consumer group" so you can have multiple clusters of consumers each reading the entire log separately, but shared within each cluster.

Also I'd recommend looking at Apache Pulsar for a next-generation architecture that combines Kafka's log semantics with the low-latency routing and individual message acknowledgements of message queues: https://pulsar.incubator.apache.org/

> it seems like it's almost a hybrid between a message queue and a database?

If Kafka supported data processing (queries or whatever) then it would be much closer to databases. Also, databases are normally aware of the structure of the data (for example, columns).

Therefore, Kafka can hardly be viewed as a DBMS because it explicitly separates two major concerns:

* data management - how to represent data (Kafka)

* data processing - how to derive/infer new data (Kafka Streams - a separate library)

Theoretically, if they could combine these two layers of functionality in one system then it would be a database.

For me, the best way to think about Kafka is as a distributed log. That said, it does feature rather robust transformations of log entries - does KSQL count as ”queries or whatever”? https://www.confluent.io/product/ksql/
As far as I understand, KSQL is not integral part of Kafka - it is based on Kafka Streams (an independent library). So Kafka is not one system and hence some ambiguity when referring to its functionalities. In particular, "Kafka is as a distributed log" means only Kafka core, not Kafka Streams and KSQL.
love this article about the log abstraction and some of the motivation behind Kafka. It's a great resource to get a feel for what it is and how you can use it:

https://engineering.linkedin.com/distributed-systems/log-wha...