|
Answering questions in a rapid, interactive way (, while using C to be efficient enough that one can run it on millions of rows): # Given a dataset that looks like this…
> head(dt, 3)
mpg cyl disp hp drat wt qsec vs am gear carb name
1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4
2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag
3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Datsun 710
# What's the mean hp and wt by number of carburettors?
> dt[, list(mean(hp), mean(wt)), by=carb]
carb V1 V2
1: 4 187.0 3.8974
2: 1 86.0 2.4900
3: 2 117.2 2.8628
4: 3 180.0 3.8600
5: 6 175.0 2.7700
6: 8 335.0 3.5700
# How many Mercs are there and what's their median hp?
> dt[grepl('Merc', name), list(.N, median(hp))]
N V2
1: 7 123
# Non-Mercs?
> dt[!grepl('Merc', name), list(.N, median(hp))]
N V2
1: 25 113
# N observations and avg hp and wt per {num. cylinders and num. carburettors}
> dcast(dt, cyl + carb ~ ., value.var=c("hp", "wt"), fun.aggregate=list(mean, length))
cyl carb hp_mean wt_mean hp_length wt_length
1: 4 1 77.4 2.151000 5 5
2: 4 2 87.0 2.398000 6 6
3: 6 1 107.5 3.337500 2 2
4: 6 4 116.5 3.093750 4 4
5: 6 6 175.0 2.770000 1 1
6: 8 2 162.5 3.560000 4 4
7: 8 3 180.0 3.860000 3 3
8: 8 4 234.0 4.433167 6 6
9: 8 8 335.0 3.570000 1 1
I used slightly verbose syntax so that it is (hopefully) clear even to non-R users.You can see that the interactivity is great at helping you compose answers step-by-step, molding the data as you go, especially when you combine with tools like plot.ly to also visualize results. |
I still think R has an atrocious design as a programming language (although it also has its beautiful side - like when you discover that literally everything in the language is a function call, even all the control structures and function definitions!). It can be optimized for this sort of thing, while still having a more regular syntax and fewer gotchas. The problem is that in its niche, it's already "good enough", and it is entrenched through libraries and existing code - so any contender can't just be better, it has to be much better.