Hacker News new | ask | show | jobs
by krumbie 1603 days ago
That's true. However, I believe that many R programmers don't know when non-standard evaluation happens or what it is exactly. Functions with or without it cannot be told apart just by looking at the syntax.

While NSE enables the dplyr syntax that many people enjoy, for me it's too magic and I have trouble reasoning about variable names in other people's code.

1 comments

What does dplyr syntax look like?
Let's say you have a data frame

    df = tibble(a = c(1, 2))
and you want to use a dplyr verb to modify it

    mutate(df, b = a + 1)
the `a` in the above expression refers to the column in `df`, but this means it's hard to reference a variable in the outer scope named `a`. Furthermore, if you have a string referring to the column name `"a"`, you can't simply write

    mutate(df, b = a_var + 1)
Contrast this with DataFramesMeta.jl, which is a dply-like library for Julia, written with macros.

    df = DataFrame(a = [1, 2])
    @transform df :b = :a .+ 1 
Because of the use of Symbols, there is no ambiguity about scopes. To work with a variable referring to column `a` you can write

    a_str = "a"
    @transform df :b = $a_str .+ 1
I won't pretend this isn't more complicated or harder to learn. Some of the complexity is due to Julia's high performance limiting non-standard evaluation in subtle ways. But a core strength of Julia's macros is that it's easy to inspect these expressions and understand exactly what's going on, with `@macroexpand` as shown in the blog post.

DataFramesMeta.jl repo: https://github.com/JuliaData/DataFramesMeta.jl

To reference variables in the outer scope, you would do

    mutate(df, b = .env$a + 1)
And if you have a string (contained in a_var) which identifies a variable you can do

    mutate(df, b = .data[[a_var]] + 1)
You could argue these feel clumsy, but I wouldn’t say it’s “hard” to do either of these things with dplyr.
I don't think it's just about whether it's hard to do, your syntax example looks short enough and one can memorize these two patterns relatively quickly.

However, both patterns are another special case how identifiers are resolved in the expression. Aren't `.env` and `.data` both valid variable and column names? So what happens if I have a column named `.data`?

Another example, which is the reason why we chose the `:column` style to refer to columns in `DataFramesMeta.jl` and `DataFrameMacros.jl`:

What happens if you have the expression `mutate(df, b = log(a))`. Both `log` and `a` are symbols, but `log` is not treated as a column. Maybe that's because it's used in a function-like fashion? Maybe because R looks at the value of `log` and `a` in their scope and sees that `log` is a function an `a` isn't?

In Julia DataFrames, it's totally valid to have a column that stores different functions. With the dplyr like syntax rules it would not be possible to express a function call with a function stored in a column, if the pattern really is that function syntax means a symbol is not looked up in the dataframe anymore.

In Julia DataFrameMacros.jl for example, if you had a column named `:func` you could do `@transform(df, :b = :func(:a))` and it would be clear that `:func` resolves to a column.

This particular example might seem like a niche problem, but it's just one of these tradeoffs that you have to make when overloading syntax with a different meaning. I personally like it if there's a small rule set which is then consistently applied. I'd argue that's not always the case with dplyr.

I hadn't thought of that tradeoff. After testing just now, if you have a column named `.data` or `.env` those constructs work as if there was no such column, and actually in that case `mutate(df, b = .data + 1)` is an error.

Personally I'll happily take not being able to use those as column names if it means I can avoid always typing : before every in-data variable, but your comment gave me a better understanding of why it would be bad for some other person or scenario, perhaps where short term ease-of-use is lower on the list of priorities.

For your second example, it doesn't come up in R because a data frame column cannot be a function. Columns must be vectors (including lists) and you could have a vector where one or all elements are functions, but the column itself cannot not be a function (functions are not vectors), so there's no ambiguity there. To call a function stored in your data frame you'd have to access an element of the column, and any access method, e.g. `[[` or `$` would make the resulting set of characters invalid as the name of an object (without backticks, which would then disambiguate the intent)

    df <- tibble(x = list(function(x) x + 1))
    df %>% 
      mutate(y = x[[1]](3))
Separate from dplyr, in R when you use `(` to call a function it searches only for functions by that name.

    log <- 3
    log(1)
    # 0

    frog <- 3
    frog(3)
    # Error in frog(3) : could not find function "frog"
    
    log <- function(x) x^2
    log(1)
    # 1
In Julia you could have an `AbstractVector` type also be callable, or more likely a vector of callable objects (and the operation is performed row-wise).

I agree it's unlikely that a user will name their column `.data`. But it certainly saves developer effort from thinking about these issues.

The larger concern, really, is that Julia needs to know which things are columns and which things are variables in an expression at parse time in order to generate fast code for a DataFrame. It needs to do this without inspecting the data frame, since the data frame's contents aren't known at parse time.

One option would be to make all literals columns. But then you run into issues with things like `missing`, which would have to be escaped or not recognized as a column. Its hard to predict all the problems there, and any escaping rules would definitely have to be more complicated than R's. So we require `:` and take the easy way out, which has the added benefit for new users who might get confused about the variable-column distinction.

It would be interesting to profile the 2nd version though. Assuming the non-standard evaluation has performance benefits (which they do in DataFramesMeta.jl), are you eliminating those benefits when you use

    .data[[a_var]]

?
It's even better when you have the "." variable which get populated.

But in general yeah, R plays pretty fast and loose with scopes, and lets you capture expressions as arguments and execute them in a different scope from the outside one