Hacker News new | ask | show | jobs
by gopalv 1826 days ago
I've hit almost the exact same issue with Hive, with a somewhat temporary workaround (like this post) to build a balanced tree out of this by reading it into a list [1] and rebuilding a binary balanced tree out of it.

But we ended up implementing a single level Multi-AND [2] so that this no longer a tree for just AND expressions & can be vectorized neater than the nested structure with a function call for each (this looks more like a tail-call rather than a recursive function).

The ORC CNF conversion has a similar massively exponential item inside which is protected by a check for 256 items or less[3].

[1] - https://github.com/t3rmin4t0r/captain-hook/blob/master/src/m...

[2] - https://issues.apache.org/jira/browse/HIVE-11398

[3] - https://github.com/apache/hive/blob/master/storage-api/src/j...

1 comments

It seems to me that the basic problem is that binary trees are the wrong data type. For instance, you can transform the tree to balance it:

    p1 AND (p2 AND (p3 AND notp4)) -> (p1 AND p2) AND (p3 AND notp4)
But the abstraction is specifying the order of operations unnecessarily. Using general trees, I think you avoid the need to transform the order of operations and "NOT" doesn't have to be a special case.

    ALL (p1 p2 p3 NOT(p4))
Is there any reason to choose binary trees for this? (Other than inertia).
> the abstraction is specifying the order of operations unnecessarily

That needs an "in SQL", the standard imperative language operator ordering has short-cut operations in there (a is null or a.value == true) etc.

In the code I work with, this actually sorts the conditions based on estimated selectivity[1] and type (long compares to constant are cheaper on a columnar data-set due to the SIMD, but string isn't etc).

> Is there any reason to choose binary trees for this?

The parse-tree does come off as binary because inserting logical parentheses makes it easier to tackle, because there are association rules which neatly go into a BinaryOp structure when dealing with operator precedence in parsing.

So it is easier to handle the parsing when you treat (a + (b + c) ) and (a / (b / c)) in similar fashion.

I won't make the same mistake again if I have to build a SQL engine, but this actually made the logical expression match the parse tree very closely and was a good enough generalization until the traversal time bugs[2] started to pop up.

[1] - https://github.com/apache/hive/blob/master/common/src/java/o... [2] - https://issues.apache.org/jira/browse/HIVE-9166