Hacker News new | ask | show | jobs
by ryanmonroe 915 days ago
Good article. Some smaller changes you could make to the final function: In the last line `as.data.frame(do.call(cbind, out_list))` is used to convert the list to a data.frame. Passing it to `cbind` converts the list to a matrix (i.e. combines it into one long vector internally), and then `as.data.frame` converts it back to a list (as noted in the article data frames are lists). Instead, you can use `as.data.frame(out_list)` to make your list a data frame directly, to avoid converting the list to a matrix and back to a list again. The `unlist(lapply(split(cvec, groups), aggfun))` is also doing a lot of work, if you don't mind using an external package*, `collapse::BY(cvec, groups, aggfun)` is much faster (doesn't require converting `groups` to factor, doesn't copy `cvec`'s contents like `split`).

Here's some completely not-the-point of the article code review since I can't help myself. If you can set up earlier steps give you a named list for `col_grouping`, and use `lapply`, the code is a little more concise:

    efficient_flow_agg <- function(dat, col_grouping, gpcol_name="GroupMembership") {
      make_postproc <- function(gp, groups) {
        gp$preproc(dat[gp$which_cols]) |>
          lapply(collapse::BY, groups, gp$aggfun) |>
          gp$postproc()
      }
      col_grouping |>
        lapply(make_postproc, groups = dat[[gpcol_name]]) |>
        as.data.frame()
    }
* I had previously written here that `tapply` is probably faster, but apparently `tapply` does exactly `unlist(lapply(split(x, g), f)))` anyway? wtf R. Strange there's not something like `collapse::BY` in base R.
1 comments

Thanks for the feedback! The business with `cbind` is a facepalm, I'll definitely fix that. I don't think it will affect performance much since that last step won't be repeated many times, but it makes me cringe now knowing how redundant it is.

Good advice on `col_grouping` as well, accessing those components of an aggregation rule by index rather than by name is a bad code smell and decreases readability for sure.

Main issue with cbind()ing to matrix then data.frame is conversion of all columns to the same type and potential loss of information