| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by migiale 5397 days ago
	Unfortunately, it's almost impossible to work with a very large datasets in R, because of the speed limitations. Many researchers I know use Matlab because of this.

4 comments

hvs 5397 days ago

What about Octave? Other than my use in the Stanford Machine Learning class, I've never really used either, so I don't have any basis for comparison.

link

rflrob 5397 days ago

My recollection is that Octave is significantly slower than Matlab, and some quick googling on benchmarks [1] suggests that it is (was?) as slow or slower than R.

I've complained before that Octave is the wrong solution to the Matlab problem, and if you aren't attached to one of the many fine Matlab toolkits, you're likely better served translating to a more expressive language, like Python+Numpy+Scipy.

[1] http://sciviews.org/benchmark/

link

migiale 5397 days ago

Octave is Matlab clone, in fact Octave developers openly say that except for some special cases, any difference between Octave and Matlab is a bug.

The biggest difference between Matlab and Octave is JIT compiler in Matlab, which does incredibly good job at vectorizing simple (or sometimes even not-so simple) loops.

I think it's fair to say that Octave performance is very close to a Matlab in a pre-JIT time.

There's also a huge difference in toolboxes, profiling, sparse matrix operations, parallel computing and many-many more. In these areas I'm afraid Octave is light-years behind Matlab.

However, you still can do a lot of useful simple stuff with Octave and it's free! Matlab-like syntax is really, really cool then it comes to vectorized operations. So probably these two reasons determined Andrew Ng's choice of Octave as a main environment for ml-class. Huge win for Octave I guess. This might spur some interest in the development, attract new people to the product. I think it's a well-deserved success for John W Eaton and other people who develop(ed) Octave all these years.

link

mturmon 5397 days ago

I agree with your take on Octave performance relative to Matlab. The Matlab parallel toolbox is getting more and more useful in a multicore world.

As you note, the Matlab profiler is very nice. You can zero in on the 80% of the 80/20 tradeoff very fast, during your usual development cycle. It's as simple as:

>> profile on >> do_something >> profile report

and you get a nice graphical/textual report on time usage in everything do_something called.

link

muuh-gnu 5397 days ago

> in fact Octave developers openly say that

This is not true. They strive for Matlab language compatibility, but none of them refers to Octave as a "Matlab clone", nor are they working on cloning Matlab, nor was the project started to become a matlab clone. It is like calling Linux a "Unix clone".

link

tonyt 5396 days ago

It can't be that bad, Oracle are shipping it in their new Big Data Appliance.

http://radar.oreilly.com/2011/10/oracles-big-data-appliance....

It's probably more an issue of easily pre-filtering/aggregating the data before analysing it with R. I like this approach of moving the calculation to the data, but we must be very late on the adoption curve if Oracle are doing it already.

link

carbocation 5397 days ago

For statistical genetics at least, it's common to process much of the data in parallel, so the RAM limitations on one R instance are not the gating factor.

link

eastwest 5396 days ago

Having seen and heard about what Bioconductor had to do to process genetic data, memory is a huge issue. It is even more so with next-generation sequencing data.

link

carbocation 5396 days ago

Yes, I guess I've always operated under the assumption that I've needed to parallelize dramatically. I usually operate on data from families of ~40 people with next-gen sequencing data, and the tools that I use generally finish within about an hour.

link

xtracto 5397 days ago

I use R every day for my research (doing social simulations sometimes based on sample surveys). An additional R limitation is the memory limit. R cannot use virtual memory and the maximum amount of data is limited.

There are two ways to deal with that, one is to load datasets through SQL database (using a SQL library) which IMHO is a "dirty hack". The other (what I usually do) is to load the huge datasets in STATA (or any other stats package) and filter the data to get a set that is small enough to work with R.

Other than that, the available libraries in R are crazy good. for example stuff like Approximate Bayesian Computation or survey analysis (considering weight factors) is straightforward with available libraries.

link