|
|
|
|
|
by davidrowley
856 days ago
|
|
There certainly are valid reasons for this. For example, adding a join condition with an OR clause. The only join operator that supports non-equi joins is Nested Loop. If you went from a Hash or Merge join to that, then you'd likely notice some performance degradation. If you have a link to anywhere you've asked for help on this, then I'd be interested to see more details. |
|
This was really the only resource I found acknowledging it. It sounds like it has do with presumption of nulls (e.g. spark can’t assume they won’t be there) but it would be great to be able to say “don’t worry spark I promise there are no nulls/if there are just disregard”) https://kb.databricks.com/sql/disable-broadcast-when-broadca...
The way we go around this feels so brutish. Literally just did two separate joins and then unioned the results. The recommendation to use ‘not exists’ couldn’t be applied as array_contains must be using ‘in’ under the hood and couldn’t be changed.