Hacker News new | ask | show | jobs
by jsjsbdkj 1960 days ago
This is the most basic pattern for distributed joins - you hash on the join key in both tables and shuffle data based on hash ranges. In some systems like Redshift you can designate the key for distribution so that "related" records are already co-located on a single shard.

> our data would already be sorted for the given keys (because within a partition Kafka has sorting guarantees)

It's been a while since I used Kafka but I don't remember "sorting guarantees". Consumers see events "in order" based on when they were produced, because each partition is a queue.

1 comments

Yes I guess my point is when using Kafka in combination with Kafka Streams and you produce things partitioned in a way that you need them for consumption then you do not need to do any shuffling in the instance where you want to join because data is already partitioned correctly.
You seem to know what you're talking about. Any recommendations on learning resources for this type of flow? Or really understanding which platform works for in each situation?

I'm learning proper data flow in real time as I look to transition ETL of product data into Postgres to a more applicable system.

Finding the right learning resources is difficult! Cheers.