Hacker News new | ask | show | jobs
by kprybol 3398 days ago
Julia's biggest hurdle is the lack of well functioning DataFrames (or the current fork, DataTables). Tons of issues around nullable arrays, etc. have really slowed progress. I do think it's got a ton of upside, but I've found that reimplementing my R or Python scripts in Julia to be too much of a hassle. Costs of reimplemention greatly outweigh the not insignificant gains in speed.

Also check out this article on updates to R 3.4. R tends to be fast enough for most work (I use it regularly on one-off analysis or things that won't ever make it farther than ad-hoc reporting/findings but can't imagine using it in production systems). The listed changes should go a long way towards making R just fast(er) enough for dealing with larger datasets (doesn't help with datasets larger than memory though). For large datasets all the momentum seems to be moving towards Spark (sparklyr is RStudio's SparkR integration. Very much a beta but getting better by the day). On the Python front Dask is awesome for out of memory computation that has no equivalent in R.

3 comments

For large datasets all the momentum seems to be moving towards Spark (sparklyr is RStudio's SparkR integration.

Worst case, you can always use MPI with R and run on a Beowulf cluster. Of course that might not help if you want to use a function from a library, and the library itself expects everything to be in memory on one node, but at least it gives you another option for parallelization.

Absolutely, though as you mention, removing the ability to use packages and the necessity of writing statistical code that properly accounts for data being spread out across multiple nodes would likely be out of the reach of your everyday/typical R user. An open sourced alternative to Revolution R/Microsoft R Server's out of core processing backend + distributed analtyics packages would be a huge addition to the R language.
Realized I never posted the link about R 3.4 that I referenced. https://cdn.ampproject.org/c/s/www.r-bloggers.com/performanc...
How have I not heard of Dask? Would have saved me a lot of pain when trying to deal with oom datasets in my Luigi pipeline.