Hacker News new | ask | show | jobs
by kwillets 623 days ago
One additional thought regarding query performance is that content-defined row groups allow localized joins and aggregations which are much faster than the globally-shuffled kind.

If the sharding key matches (or is a subset of) a join or group-by key, then identical values are local to a single shard, which can be processed independently.

This type of thing is typically done at large granularity (eg one shard per MPP compute node), but there are also benefits down to the core or thread level.

Another tip is that if no shard key is defined, hash the whole row as a default.

1 comments

Strike the sharding idea as I misunderstood the CDC method.