|
|
|
|
|
by ryanmonroe
915 days ago
|
|
Good article. Some smaller changes you could make to the final function: In the last line `as.data.frame(do.call(cbind, out_list))` is used to convert the list to a data.frame. Passing it to `cbind` converts the list to a matrix (i.e. combines it into one long vector internally), and then `as.data.frame` converts it back to a list (as noted in the article data frames are lists). Instead, you can use `as.data.frame(out_list)` to make your list a data frame directly, to avoid converting the list to a matrix and back to a list again. The `unlist(lapply(split(cvec, groups), aggfun))` is also doing a lot of work, if you don't mind using an external package*, `collapse::BY(cvec, groups, aggfun)` is much faster (doesn't require converting `groups` to factor, doesn't copy `cvec`'s contents like `split`). Here's some completely not-the-point of the article code review since I can't help myself. If you can set up earlier steps give you a named list for `col_grouping`, and use `lapply`, the code is a little more concise: efficient_flow_agg <- function(dat, col_grouping, gpcol_name="GroupMembership") {
make_postproc <- function(gp, groups) {
gp$preproc(dat[gp$which_cols]) |>
lapply(collapse::BY, groups, gp$aggfun) |>
gp$postproc()
}
col_grouping |>
lapply(make_postproc, groups = dat[[gpcol_name]]) |>
as.data.frame()
}
* I had previously written here that `tapply` is probably faster, but apparently `tapply` does exactly `unlist(lapply(split(x, g), f)))` anyway? wtf R. Strange there's not something like `collapse::BY` in base R. |
|
Good advice on `col_grouping` as well, accessing those components of an aggregation rule by index rather than by name is a bad code smell and decreases readability for sure.