|
|
|
|
|
by javierluraschi
2611 days ago
|
|
In my experience, R is really fast since I t was designed to store data in columnar format which we now all know is best for data analysis. So, in most cases, scaling up computation is quite easy. To scale out, you can use Apache Spark with R, the interface I’ve worked on, sparklyr is quite easy to use and allows you to scale out computation. Just to give you an example of what’s possible, I was playing around yesterday with a ray tracing prototype someone is building and scaled it out in Spark, see https://twitter.com/javierluraschi/status/112055769372135424... — it’s a misconception that R is slow or can’t scale. |
|
Column stores are standard in any analytics pipeline today. They make up Python's Pandas, R's dplyr, and Java's DataFrame. How or why does R stand out for 'massive amounts of data'?
R does not have have meaningful out of core compute offerings that compare with something like Dask.
R does not at all have cluster compute offerings that compare to Dask Distributed.
If you want to know what real performance looks like, check out Python's cudf which will shortly fully match the Pandas api. That raytracing example you linked would run at interactive rates with cudf, I really don't see any basis for perf arguments in R's favour, and 'massive data' arguments are laughable here.
Whatever advantages R has, perf or scalability are definitely not amongst them.