Hacker News new | ask | show | jobs
by mindcrime 3406 days ago
For large datasets all the momentum seems to be moving towards Spark (sparklyr is RStudio's SparkR integration.

Worst case, you can always use MPI with R and run on a Beowulf cluster. Of course that might not help if you want to use a function from a library, and the library itself expects everything to be in memory on one node, but at least it gives you another option for parallelization.

1 comments

Absolutely, though as you mention, removing the ability to use packages and the necessity of writing statistical code that properly accounts for data being spread out across multiple nodes would likely be out of the reach of your everyday/typical R user. An open sourced alternative to Revolution R/Microsoft R Server's out of core processing backend + distributed analtyics packages would be a huge addition to the R language.