|
|
|
|
|
by appplication
856 days ago
|
|
I see you’re definitely familiar with the space. I think the condition was using ‘or array_contains’ in the join. This was really the only resource I found acknowledging it. It sounds like it has do with presumption of nulls (e.g. spark can’t assume they won’t be there) but it would be great to be able to say “don’t worry spark I promise there are no nulls/if there are just disregard”)
https://kb.databricks.com/sql/disable-broadcast-when-broadca... The way we go around this feels so brutish. Literally just did two separate joins and then unioned the results. The recommendation to use ‘not exists’ couldn’t be applied as array_contains must be using ‘in’ under the hood and couldn’t be changed. |
|
In theory, a hash join is possible with a condition like `ON t1.a = t2.a OR t1.b = t2.b`, but Hash Join would need to build two hash tables and only probe the 2nd one if the first lookup found nothing. I don't know if there are any RDBMSs that allow multiple hash tables in a hash join.