| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dbecker 5181 days ago

I've done a lot of work using both R and Python, and a little bit using Matlab (and SAS and Stata).

Matlab is behind R and Python for both data cleaning and analysis. This is especially true if you have string variables and factor variables. Matlab's great if you are doing matrix operations of clean data (and you don't need to do anything fancy in how you report them). But, I don't find it worth using for real data analysis.

Python and R are both great, though they feel pretty similar to me. Python obviously has the big advantage if you want to do general purpose computing too. Python has been faster than R in most cases I've compared them. For my work, development speed is more important than execution speed... so this hasn't been a huge factor for me.

R has a couple of big advantages. First, the libraries. This is a big deal for me. I saw that there is a python equivalent to ggplot2 in the works. This will definitely strengthen the case for python, but the availability of libraries in R is awesome.

Second, the community and help resources in R are amazing. I rarely run into a problems in R that haven't already been addressed on stackexchange.com.

Perhaps I should be more proactive about asking python questions when I run into them, but I usually just work it out myself (which is more time consuming than looking the answer up online.)

Lastly, I'm not an expert on big data. But, spending relatively little time with both R's bigmemory and Python's PyTables, it seems easier to get up to speed on big data with R at the moment.

Though I haven't met them, my sense is that Wes, Travis Oliphant and the other relevant python developers are putting in a heroic effort to get Python up to speed. I have every expectation that Python will be my choice of the future.

3 comments

drunkpotato 5181 days ago

Note: This reply is mostly helpful if you work with legacy Matlab code, or have colleagues who primarily know Matlab, and you have to work with string data.

In general I agree with you; Matlab's age and origins show through in some warty ways, and one of them is string processing. Whenever I have to process anything that's not simple CSV or Excel, I use Python. (For XML, there's Perl Xpath command line tool, which has come in pretty handy for simple XML extraction.)

That said, however, the Statistics Toolbox has classes dataset, nominal, and ordinal that take a huge amount of the pain out of working with string data. Dataset lets you mix column types and refer to them by name, and lets you name rows if you like. I think it's similar to a dataframe in R. Nominal and ordinal are efficient representations for string columns. They are a workaround for Matlab's lack of a runtime string pool, but also are fast and small.

link

clebio 5181 days ago

This is a really detailed comparison and seems balanced. Thanks for that. Do you not use both in conjunction (R for the stats libraries and Python for the more general computing)? Having done some background reading (but without yet getting down into the weeds with them), that was my take on the relationship.

Edit: re-reading your last paragraph, I guess you're saying that Python can reach parity with R's libraries, at which point it's elegance and speed will win out. R's decades of lineage do seem to hobble it in terms of style and syntax, after all.

link

dbecker 5180 days ago

Even though I hear complaints about R's syntax, I don't know exactly what people dislike about it. In fact, I kind of like R's syntax.

As an example, I like the ability to use expressions on the left side of an assignment (e.g. names(df) = "stuff"). But, it sounds like you are right that the python developers are getting to learn from R's mistakes and avoid getting locked into to legacy ideas.

As far as libraries... R has a lot. So I don't expect python to totally catch up soon. But, I only use 10 or 15 R libraries, and those are really popular libraries. So, unless you do an incredible range of stuff, python probably doesn't hvae to completely catch up.

One major advantage for R is the package management system (CRAN). The uniformity of the interface... the ability to search for stuff in it... that's been really useful for me. Not sure if anything like that is in the works for python.

Lastly, there are a lot of little helper functions that I've written for myself in python that are part of base R. The first example that comes to mind is head() to view the top few lines of a data structure. It seems strange that python would be missing these little helper functions, but I never found it.

link

Estragon 5181 days ago

I have been programming python too long to make an objective comparison with R. I have had to use R libraries at times, and I've found rpy to be a workable bridge from R to python for this purpose. Depending on how it works under the hood, it might not be appropriate for big data, though. Also, I had to custom modify some R libraries to work with my data, so it has been useful to know a bit of both, although I mostly picked up the R as I did the mods.

link

ballooney 5181 days ago

I would add that an advantage of Matlab for number bashing is a much more native handling of linear algebra. Let's say you have two matrices A and B, in matlab I could write:

  A*B*A'

Whereas in python it would be (approximately):

  dot(A, dot(B, inverse(A)))

so matlab can evaluate everything in the right order (right to left) whereas for numpy I end up writing a little recursive function to dot a list of arguments together from right to left, which feels a bit cludgey and more of an impediment to getting your ideas down in code. Especially when your equations get very big as they often do with stats!

link

lbolla 5181 days ago

If you intend "matrix multiplication", there is no "right order": "matrix multiplication" is associative (http://en.wikipedia.org/wiki/Matrix_multiplication).

Numpy has a "matrix" type, so you can write:

        In [7]: A = numpy.matrix('1 2; 3 4; 5 6')                                                         
        In [8]: B = numpy.matrix('1 2 3; 4 5 6')                                                          
        In [9]: C = numpy.matrix('1; 2; 3')                                                               

        In [10]: A * B * C
        Out[10]: 
        matrix([[ 78],                                                                                    
                [170],                                                                                    
                [262]])                                                                                   
        
        In [11]: (A * B) * C                                                                              
        Out[11]:                                                                                          
        matrix([[ 78],
                [170],                                                                                    
                [262]])                                                                                   

        In [12]: A * (B * C)                                                                              
        Out[12]: 
        matrix([[ 78],                                                                                    
                [170],
                [262]])

link

maxerickson 5181 days ago

It isn't native, but numexpr is a nice in between:

http://code.google.com/p/numexpr/

link