Hacker News new | ask | show | jobs
by appplication 856 days ago
I see you’re definitely familiar with the space. I think the condition was using ‘or array_contains’ in the join.

This was really the only resource I found acknowledging it. It sounds like it has do with presumption of nulls (e.g. spark can’t assume they won’t be there) but it would be great to be able to say “don’t worry spark I promise there are no nulls/if there are just disregard”) https://kb.databricks.com/sql/disable-broadcast-when-broadca...

The way we go around this feels so brutish. Literally just did two separate joins and then unioned the results. The recommendation to use ‘not exists’ couldn’t be applied as array_contains must be using ‘in’ under the hood and couldn’t be changed.

1 comments

UNION is certainly one way to eliminate the OR condition.

In theory, a hash join is possible with a condition like `ON t1.a = t2.a OR t1.b = t2.b`, but Hash Join would need to build two hash tables and only probe the 2nd one if the first lookup found nothing. I don't know if there are any RDBMSs that allow multiple hash tables in a hash join.

Yeah it feels like for some arbitrarily sized array contains join having something like a bloom filter involved could help reduce the search space. But I’m not a db engineer so perhaps the specific implementation details of that would turn out to make it a bad idea.