Hacker News new | ask | show | jobs
by apechai 5134 days ago
Ayone have experience as to how Python compare to R, Matlab (Octave) and other tools for data analysis?

R has great libraries but I would prefer to use Python.

3 comments

I've done a lot of work using both R and Python, and a little bit using Matlab (and SAS and Stata).

Matlab is behind R and Python for both data cleaning and analysis. This is especially true if you have string variables and factor variables. Matlab's great if you are doing matrix operations of clean data (and you don't need to do anything fancy in how you report them). But, I don't find it worth using for real data analysis.

Python and R are both great, though they feel pretty similar to me. Python obviously has the big advantage if you want to do general purpose computing too. Python has been faster than R in most cases I've compared them. For my work, development speed is more important than execution speed... so this hasn't been a huge factor for me.

R has a couple of big advantages. First, the libraries. This is a big deal for me. I saw that there is a python equivalent to ggplot2 in the works. This will definitely strengthen the case for python, but the availability of libraries in R is awesome.

Second, the community and help resources in R are amazing. I rarely run into a problems in R that haven't already been addressed on stackexchange.com.

Perhaps I should be more proactive about asking python questions when I run into them, but I usually just work it out myself (which is more time consuming than looking the answer up online.)

Lastly, I'm not an expert on big data. But, spending relatively little time with both R's bigmemory and Python's PyTables, it seems easier to get up to speed on big data with R at the moment.

Though I haven't met them, my sense is that Wes, Travis Oliphant and the other relevant python developers are putting in a heroic effort to get Python up to speed. I have every expectation that Python will be my choice of the future.

Note: This reply is mostly helpful if you work with legacy Matlab code, or have colleagues who primarily know Matlab, and you have to work with string data.

In general I agree with you; Matlab's age and origins show through in some warty ways, and one of them is string processing. Whenever I have to process anything that's not simple CSV or Excel, I use Python. (For XML, there's Perl Xpath command line tool, which has come in pretty handy for simple XML extraction.)

That said, however, the Statistics Toolbox has classes dataset, nominal, and ordinal that take a huge amount of the pain out of working with string data. Dataset lets you mix column types and refer to them by name, and lets you name rows if you like. I think it's similar to a dataframe in R. Nominal and ordinal are efficient representations for string columns. They are a workaround for Matlab's lack of a runtime string pool, but also are fast and small.

This is a really detailed comparison and seems balanced. Thanks for that. Do you not use both in conjunction (R for the stats libraries and Python for the more general computing)? Having done some background reading (but without yet getting down into the weeds with them), that was my take on the relationship.

Edit: re-reading your last paragraph, I guess you're saying that Python can reach parity with R's libraries, at which point it's elegance and speed will win out. R's decades of lineage do seem to hobble it in terms of style and syntax, after all.

Even though I hear complaints about R's syntax, I don't know exactly what people dislike about it. In fact, I kind of like R's syntax.

As an example, I like the ability to use expressions on the left side of an assignment (e.g. names(df) = "stuff"). But, it sounds like you are right that the python developers are getting to learn from R's mistakes and avoid getting locked into to legacy ideas.

As far as libraries... R has a lot. So I don't expect python to totally catch up soon. But, I only use 10 or 15 R libraries, and those are really popular libraries. So, unless you do an incredible range of stuff, python probably doesn't hvae to completely catch up.

One major advantage for R is the package management system (CRAN). The uniformity of the interface... the ability to search for stuff in it... that's been really useful for me. Not sure if anything like that is in the works for python.

Lastly, there are a lot of little helper functions that I've written for myself in python that are part of base R. The first example that comes to mind is head() to view the top few lines of a data structure. It seems strange that python would be missing these little helper functions, but I never found it.

I have been programming python too long to make an objective comparison with R. I have had to use R libraries at times, and I've found rpy to be a workable bridge from R to python for this purpose. Depending on how it works under the hood, it might not be appropriate for big data, though. Also, I had to custom modify some R libraries to work with my data, so it has been useful to know a bit of both, although I mostly picked up the R as I did the mods.
I would add that an advantage of Matlab for number bashing is a much more native handling of linear algebra. Let's say you have two matrices A and B, in matlab I could write:

  A*B*A'
Whereas in python it would be (approximately):

  dot(A, dot(B, inverse(A)))
so matlab can evaluate everything in the right order (right to left) whereas for numpy I end up writing a little recursive function to dot a list of arguments together from right to left, which feels a bit cludgey and more of an impediment to getting your ideas down in code. Especially when your equations get very big as they often do with stats!
If you intend "matrix multiplication", there is no "right order": "matrix multiplication" is associative (http://en.wikipedia.org/wiki/Matrix_multiplication).

Numpy has a "matrix" type, so you can write:

        In [7]: A = numpy.matrix('1 2; 3 4; 5 6')                                                         
        In [8]: B = numpy.matrix('1 2 3; 4 5 6')                                                          
        In [9]: C = numpy.matrix('1; 2; 3')                                                               

        In [10]: A * B * C
        Out[10]: 
        matrix([[ 78],                                                                                    
                [170],                                                                                    
                [262]])                                                                                   
        
        In [11]: (A * B) * C                                                                              
        Out[11]:                                                                                          
        matrix([[ 78],
                [170],                                                                                    
                [262]])                                                                                   

        In [12]: A * (B * C)                                                                              
        Out[12]: 
        matrix([[ 78],                                                                                    
                [170],
                [262]])
It isn't native, but numexpr is a nice in between:

http://code.google.com/p/numexpr/

"A[n]yone have experience as to how Python compare to R, Matlab (Octave)"

Amusing coincidence: this article appeared on Slashdot this Wednesday:

Comparing R, Octave, and Python for Data Analysis:

http://developers.slashdot.org/story/12/05/23/1956219/compar...

You might want to look into RPy for using R libraries with python.
I've very briefly experimented with rpy2. It got the job done, but I thought it was tricky enough that I'd want a good reason to combine R and python. Otherwise, I'd try and do the whole project in one or the other (And I haven't used rpy2 since I first tried it out.)