| I've done a lot of work using both R and Python, and a little bit using Matlab (and SAS and Stata). Matlab is behind R and Python for both data cleaning and analysis. This is especially true if you have string variables and factor variables. Matlab's great if you are doing matrix operations of clean data (and you don't need to do anything fancy in how you report them). But, I don't find it worth using for real data analysis. Python and R are both great, though they feel pretty similar to me. Python obviously has the big advantage if you want to do general purpose computing too. Python has been faster than R in most cases I've compared them. For my work, development speed is more important than execution speed... so this hasn't been a huge factor for me. R has a couple of big advantages. First, the libraries. This is a big deal for me. I saw that there is a python equivalent to ggplot2 in the works. This will definitely strengthen the case for python, but the availability of libraries in R is awesome. Second, the community and help resources in R are amazing. I rarely run into a problems in R that haven't already been addressed on stackexchange.com. Perhaps I should be more proactive about asking python questions when I run into them, but I usually just work it out myself (which is more time consuming than looking the answer up online.) Lastly, I'm not an expert on big data. But, spending relatively little time with both R's bigmemory and Python's PyTables, it seems easier to get up to speed on big data with R at the moment. Though I haven't met them, my sense is that Wes, Travis Oliphant and the other relevant python developers are putting in a heroic effort to get Python up to speed. I have every expectation that Python will be my choice of the future. |
In general I agree with you; Matlab's age and origins show through in some warty ways, and one of them is string processing. Whenever I have to process anything that's not simple CSV or Excel, I use Python. (For XML, there's Perl Xpath command line tool, which has come in pretty handy for simple XML extraction.)
That said, however, the Statistics Toolbox has classes dataset, nominal, and ordinal that take a huge amount of the pain out of working with string data. Dataset lets you mix column types and refer to them by name, and lets you name rows if you like. I think it's similar to a dataframe in R. Nominal and ordinal are efficient representations for string columns. They are a workaround for Matlab's lack of a runtime string pool, but also are fast and small.