| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by quapster 152 days ago

Joins are where the abstraction leak between “relational algebra” and “physics of the cluster” becomes impossible to ignore.

On paper, join order is a combinatorial search over equivalent expressions. In reality, you’re optimizing over three very non-relational constraints: data distribution (where bytes actually live), cardinality estimation (how wrong your stats are), and memory/network contention (what everyone else is running). That’s why so many OLAP setups quietly give up and denormalize: not because joins are conceptually hard, but because getting good enough plans under bad stats and skewed data is brutally hard and very user-visible when it fails.

What’s interesting about systems like StarRocks, ClickHouse, DuckDB, etc is that they’re implicitly making a bet: “we can push the optimizer and execution engine far enough that normalized schemas become operationally cheaper than the hacks (wide tables, pre-joined materializations, bespoke streaming DAGs).” If that bet holds, the real win isn’t just faster joins, it’s shifting complexity back from application-specific pipelines into a general-purpose optimizer that can be improved once and benefit everyone.

The irony is that the more powerful the optimizer, the more your “logical” schema becomes a performance API surface. A small change in constraints, stats collection, or distribution keys can be worth more than any new feature, but it’s also harder to reason about than “this table is pre-joined.” So we’re trading one kind of complexity (manual denormalization and backfills) for another (making the cost model and distribution-aware planner smart enough to not shoot you in the foot).