| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ryanmonroe 915 days ago

Good article. Some smaller changes you could make to the final function: In the last line `as.data.frame(do.call(cbind, out_list))` is used to convert the list to a data.frame. Passing it to `cbind` converts the list to a matrix (i.e. combines it into one long vector internally), and then `as.data.frame` converts it back to a list (as noted in the article data frames are lists). Instead, you can use `as.data.frame(out_list)` to make your list a data frame directly, to avoid converting the list to a matrix and back to a list again. The `unlist(lapply(split(cvec, groups), aggfun))` is also doing a lot of work, if you don't mind using an external package*, `collapse::BY(cvec, groups, aggfun)` is much faster (doesn't require converting `groups` to factor, doesn't copy `cvec`'s contents like `split`).

Here's some completely not-the-point of the article code review since I can't help myself. If you can set up earlier steps give you a named list for `col_grouping`, and use `lapply`, the code is a little more concise:

    efficient_flow_agg <- function(dat, col_grouping, gpcol_name="GroupMembership") {
      make_postproc <- function(gp, groups) {
        gp$preproc(dat[gp$which_cols]) |>
          lapply(collapse::BY, groups, gp$aggfun) |>
          gp$postproc()
      }
      col_grouping |>
        lapply(make_postproc, groups = dat[[gpcol_name]]) |>
        as.data.frame()
    }

* I had previously written here that `tapply` is probably faster, but apparently `tapply` does exactly `unlist(lapply(split(x, g), f)))` anyway? wtf R. Strange there's not something like `collapse::BY` in base R.

1 comments

franklin_p_dyer 915 days ago

Thanks for the feedback! The business with `cbind` is a facepalm, I'll definitely fix that. I don't think it will affect performance much since that last step won't be repeated many times, but it makes me cringe now knowing how redundant it is.

Good advice on `col_grouping` as well, accessing those components of an aggregation rule by index rather than by name is a bad code smell and decreases readability for sure.

link

kuhewa 915 days ago

Main issue with cbind()ing to matrix then data.frame is conversion of all columns to the same type and potential loss of information

link