| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mlyle 1824 days ago

Why not just...

  val transformedLeftTemp = transform(tree.left)

  val transformedLeft = if (transformedLeftTemp.isDefined) {
    transformedLeftTemp
  } else None

3 comments

IvanVergiliev 1824 days ago

Good question, the simplified example doesn't make this clear.

The real implementation has a mutable `builder` argument used to gradually build the converted filter. If we perform the `transform().isDefined` call directly on the "main" builder, but the subtree turns out to not be convertible, we can mess up the state of the builder.

The second example from the post would look roughly like this:

  val transformedLeft = if (transform(tree.left, new Builder()).isDefined) {
    transform(tree.left, mainBuilder)
  } else None

Since the two `transform` invocations are different, we can't cache the result this way.

There's a more detailed explanation in the old comment to the method: https://github.com/apache/spark/pull/24068/files#diff-5de773... .

link

lanna 1824 days ago

It looks like transform(tree.left) returns an Option[Tree] already (otherwise the code would not type check) so the entire if-else in the original code seems redundant and could be replaced with:

    val transformedLeft = transform(tree.left)

link

IvanVergiliev 1824 days ago

Just responded to the parent comment as well - there's an additional mutable argument to the real `transform` method so it's unsafe to invoke it directly without first checking if the tree is convertible.

link

eska 1824 days ago

It boggles my mind that the author wrote an entire long article based on this.

The rhetorical question saying that surely that weird refactor of two different functions into one, followed by calling that new, non-trivial function twice for no reason surely shouldn't affect performance.. He already lost me during the premise of the article.

link

renewiltord 1824 days ago

What is so hard to understand here? There is some library code you can't immediately change because it belongs to upstream Spark. To illustrate the problem, ne simplifies the code to represent what the problem is.

Then, ne writes some code that works around the library bug by modifying the input losslessly into something that's more easily processed by the library.

Finally, ne patches the library bug and shares the patch.

All of this is also kinda fucking obvious to not just me, but a lot of people, so I'm having a really hard time grasping if you've mixed up the illustrative simplification with the actual code, or if you think that the best engineering approach is to always patch your environment bugs instead of modifying your input, or if you just don't have a Github account or for some other reason can't read the patch.

Between that patch and https://github.com/apache/spark/pull/24910 you can see why the code is what it is.

link