| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gopalv 1826 days ago

I've hit almost the exact same issue with Hive, with a somewhat temporary workaround (like this post) to build a balanced tree out of this by reading it into a list [1] and rebuilding a binary balanced tree out of it.

But we ended up implementing a single level Multi-AND [2] so that this no longer a tree for just AND expressions & can be vectorized neater than the nested structure with a function call for each (this looks more like a tail-call rather than a recursive function).

The ORC CNF conversion has a similar massively exponential item inside which is protected by a check for 256 items or less[3].

[1] - https://github.com/t3rmin4t0r/captain-hook/blob/master/src/m...

[2] - https://issues.apache.org/jira/browse/HIVE-11398

[3] - https://github.com/apache/hive/blob/master/storage-api/src/j...

1 comments

crznp 1826 days ago

It seems to me that the basic problem is that binary trees are the wrong data type. For instance, you can transform the tree to balance it:

    p1 AND (p2 AND (p3 AND notp4)) -> (p1 AND p2) AND (p3 AND notp4)

But the abstraction is specifying the order of operations unnecessarily. Using general trees, I think you avoid the need to transform the order of operations and "NOT" doesn't have to be a special case.

    ALL (p1 p2 p3 NOT(p4))

Is there any reason to choose binary trees for this? (Other than inertia).

gopalv 1825 days ago

> the abstraction is specifying the order of operations unnecessarily

That needs an "in SQL", the standard imperative language operator ordering has short-cut operations in there (a is null or a.value == true) etc.

In the code I work with, this actually sorts the conditions based on estimated selectivity[1] and type (long compares to constant are cheaper on a columnar data-set due to the SIMD, but string isn't etc).

> Is there any reason to choose binary trees for this?

The parse-tree does come off as binary because inserting logical parentheses makes it easier to tackle, because there are association rules which neatly go into a BinaryOp structure when dealing with operator precedence in parsing.

So it is easier to handle the parsing when you treat (a + (b + c) ) and (a / (b / c)) in similar fashion.

I won't make the same mistake again if I have to build a SQL engine, but this actually made the logical expression match the parse tree very closely and was a good enough generalization until the traversal time bugs[2] started to pop up.

[1] - https://github.com/apache/hive/blob/master/common/src/java/o... [2] - https://issues.apache.org/jira/browse/HIVE-9166