Hacker News new | ask | show | jobs
by davidrowley 856 days ago
There certainly are valid reasons for this. For example, adding a join condition with an OR clause. The only join operator that supports non-equi joins is Nested Loop. If you went from a Hash or Merge join to that, then you'd likely notice some performance degradation. If you have a link to anywhere you've asked for help on this, then I'd be interested to see more details.
1 comments

I see you’re definitely familiar with the space. I think the condition was using ‘or array_contains’ in the join.

This was really the only resource I found acknowledging it. It sounds like it has do with presumption of nulls (e.g. spark can’t assume they won’t be there) but it would be great to be able to say “don’t worry spark I promise there are no nulls/if there are just disregard”) https://kb.databricks.com/sql/disable-broadcast-when-broadca...

The way we go around this feels so brutish. Literally just did two separate joins and then unioned the results. The recommendation to use ‘not exists’ couldn’t be applied as array_contains must be using ‘in’ under the hood and couldn’t be changed.

UNION is certainly one way to eliminate the OR condition.

In theory, a hash join is possible with a condition like `ON t1.a = t2.a OR t1.b = t2.b`, but Hash Join would need to build two hash tables and only probe the 2nd one if the first lookup found nothing. I don't know if there are any RDBMSs that allow multiple hash tables in a hash join.

Yeah it feels like for some arbitrarily sized array contains join having something like a bloom filter involved could help reduce the search space. But I’m not a db engineer so perhaps the specific implementation details of that would turn out to make it a bad idea.