Hacker News new | ask | show | jobs
by max_streese 1949 days ago
Hi not sure if I am just completely off here but I am wondering how this relates or compares to processing things with Kafka and Kafka Streams?

If I am reading things correctly with Kafka the workflow equivalent to what's written in the article would be to have your producer produce via hash-based-round-robin (the default partitioning algorithm) based on the key you are interested in into some topic and then your consumer would just read it and your data would already be sorted for the given keys (because within a partition Kafka has sorting guarantees) and also be co-partitioned correctly if you need to read some other topic in with the same number of partitions and the same logical keys produced via the same algorithm. No?

1 comments

This is the most basic pattern for distributed joins - you hash on the join key in both tables and shuffle data based on hash ranges. In some systems like Redshift you can designate the key for distribution so that "related" records are already co-located on a single shard.

> our data would already be sorted for the given keys (because within a partition Kafka has sorting guarantees)

It's been a while since I used Kafka but I don't remember "sorting guarantees". Consumers see events "in order" based on when they were produced, because each partition is a queue.

Yes I guess my point is when using Kafka in combination with Kafka Streams and you produce things partitioned in a way that you need them for consumption then you do not need to do any shuffling in the instance where you want to join because data is already partitioned correctly.
You seem to know what you're talking about. Any recommendations on learning resources for this type of flow? Or really understanding which platform works for in each situation?

I'm learning proper data flow in real time as I look to transition ETL of product data into Postgres to a more applicable system.

Finding the right learning resources is difficult! Cheers.