|
The syntax isn't self-describing and uses lots of abbreviations; it relies on some R magic that I found confusing when learning (unquoted column names and special builtin variables); and data.table is just a different approach to SQL and other dataframe libraries. Here's an example from the docs flights[carrier == "AA",
lapply(.SD, mean),
by = .(origin, dest, month),
.SDcols = c("arr_delay", "dep_delay")]
that's clearly less clear than SQL SELECT
origin, dest, month,
MEAN(arr_delay), MEAN(dep_delay)
FROM flights
WHERE carrier == "AA"
GROUP BY arr_delay, dep_delay
or pandas flights[filghts.carrier == 'AA'].groupby(['arr_delay', 'dep_delay']).mean()
But once you get used to it data.table makes a lot of sense: every operation can be broken down to filtering/selecting, aggregating/transforming, and grouping/windowing. Taking the first two rows per group is a mess in SQL or pandas, but is super simple in data.table flights[, head(.SD, 2), by = month]
That data.table has significantly better performance than any other dataframe library in any language is a nice bonus! |
flights.groupby("month").head(2)
Not only is does this have all the same keywords, but it is organized in a much clearer way to newcomers and labels things to look up in the API. Whereas your R code has a leading comma, .SD, and a mix of quotes and non-quotes for references to columns. You even admit the last was confusing to learn. This can all be crammed in your head, but not what I would call thoughtfully designed.