Hacker News new | ask | show | jobs
by Simorgh 3862 days ago
I agree with your assertion that R is slow, yet quick to develop in.

I recently had to loop through 1.3Gb of data (5000 files) and merge just one column from each file into a new dataset. It did so in ~2 hours. Yet the loop was just ~5 lines of code.

2 comments

This task sounds almost uniquely poorly suited for R, but this has gotten better. For example, adding a column (did you append to the right or do an actual merge/join?) used to require copying the previous table but doesn't any more.

I wonder if you tried doing things like:

* preallocate a list, then do.call(cbind, your_data) * Same as above, but with some of the faster alternatives to cbind like dplyr::bind_cols or data.table::cbind * Use data.table, which has far faster joins than base R (so does dplyr) if you were doing a true merge/join

If it was truly just adding a column rom each file together into a file, these kinds of tasks are much better using UNIX tools, in my experience.

It is slow. And it is ok. Very few times will R ever beat any other language. Usually it is not off by much, but especially if coded by a novice using for loops vs apply functions can make is 100 -1000 x slower.

Another example is the immutable structure that causes R to be a memory hog. Creating copies of data everywhere. But, again if you plan well and execute the 'best' solutions you can avoid the giant pitfalls but will rarely ever beat a equally well written python equivalent.

Post R 3.1 there are far fewer deep copies (e.g. modifying a list or adding a column to a data.frame no longer copies the whole thing like it used to).