Good question, the simplified example doesn't make this clear.
The real implementation has a mutable `builder` argument used to gradually build the converted filter. If we perform the `transform().isDefined` call directly on the "main" builder, but the subtree turns out to not be convertible, we can mess up the state of the builder.
The second example from the post would look roughly like this:
val transformedLeft = if (transform(tree.left, new Builder()).isDefined) {
transform(tree.left, mainBuilder)
} else None
Since the two `transform` invocations are different, we can't cache the result this way.
It looks like transform(tree.left) returns an Option[Tree] already (otherwise the code would not type check) so the entire if-else in the original code seems redundant and could be replaced with:
Just responded to the parent comment as well - there's an additional mutable argument to the real `transform` method so it's unsafe to invoke it directly without first checking if the tree is convertible.
It boggles my mind that the author wrote an entire long article based on this.
The rhetorical question saying that surely that weird refactor of two different functions into one, followed by calling that new, non-trivial function twice for no reason surely shouldn't affect performance.. He already lost me during the premise of the article.
What is so hard to understand here? There is some library code you can't immediately change because it belongs to upstream Spark. To illustrate the problem, ne simplifies the code to represent what the problem is.
Then, ne writes some code that works around the library bug by modifying the input losslessly into something that's more easily processed by the library.
Finally, ne patches the library bug and shares the patch.
All of this is also kinda fucking obvious to not just me, but a lot of people, so I'm having a really hard time grasping if you've mixed up the illustrative simplification with the actual code, or if you think that the best engineering approach is to always patch your environment bugs instead of modifying your input, or if you just don't have a Github account or for some other reason can't read the patch.
The real implementation has a mutable `builder` argument used to gradually build the converted filter. If we perform the `transform().isDefined` call directly on the "main" builder, but the subtree turns out to not be convertible, we can mess up the state of the builder.
The second example from the post would look roughly like this:
Since the two `transform` invocations are different, we can't cache the result this way.There's a more detailed explanation in the old comment to the method: https://github.com/apache/spark/pull/24068/files#diff-5de773... .