Hacker News new | ask | show | jobs
by B1gred 3861 days ago
Python has a steeper learning curve and is not as curtailed to simple data analysis. Many use Rstudio (an ide) and use the import data, and other tools to make then skill entry even lower.

Also, mathematicians and statisticians think functionally and the general attitude in python is to do object oriented programming while R is strictly functional programming with a little bit of object programming.

1 comments

I'm a little at odds on this--for production quality analyses, (and only for analyses) R is excellent.

However, in my experience, for the data munging required as a preliminary to the analyses, R is worse than bad. It's as if satan himself designed a language.

I find that what then happens is this: data scientists/statisticians/[your favorite word here] become reliant on programmers to clean/format the data to do the analyses.

This is all fine, but those same scientists are then put off learning python, where they could do all of their own munging, and probably 95% of the analysis they need to do, and where they could further add value by writing programs that are easier to production-alize.

Job security for those who know how to write production code, I guess.

Is your assessment based on using recent R packages? I recently learned about dplyr, magrittr and rvest in a couple of recent data science courses and it seems to me that data munging is a pleasure with R. For example, I had a rough time scraping Wikipedia using Python/BeautifulSoup (I might be a little weak using them tbh) but knocked it out with rvest and magrittr. I never wrote it up but this guy[1] did something similar and wrote a nice post about it.

[1] http://opiateforthemass.es/articles/james-bond-film-ratings/

I couldn't disagree more. R is great at munging pretty much everything but unstructured textual data. The tools are definitely behind Python if you're dealing with literal written documents.

I don't know anyone who considers themselves a "data scientist" of any sort that doesn't view their job as 80% or more data wrangling/munging/cleaning.

I write production ETL processes in R at my current job. AMA.

May I ask what tools you favor in the R environment? I just haven't found anything as performant for operations on irregular and poorly formatted time series as the pandas library, and in fact I just finished an ETL in pandas for my current job.

I'm always interested in learning a new tool, though.

I don't work much with data that would benefit from being very tight about datetimes as a dimension. I'd have to know a bit more about what was challenging before I could confidently recommend for your particularly case. My email is on my profile and I'd be happy to chat there if it's something that would be helpful.

I have largely avoided ts, zoo, etc where possible. Time series stuff seems to have a lot of specialized tooling all of which tends to be much more strict about data structure than I'm comfortable with for my flow.

I might be misrembering, but I think that it was assumed for a while that perl would be used for data munging, so there wasn't much effort put into that part of the language. That's being addressed now by packages, and a lot of the uptake in R seems to coincide with work by R developers to make the language less hostile to new users. (Have you used the bundled IDE? Satan's work again.)

BUT perl + R was a really nice combination for a while.

Depends on coworkers. When in school, my professors couldn't do anything if it wasn't in a nice csv file. But, us data scientist / statistical programmers are well versed in digesting data in almost any form, especially a database. When on a new project, I just get handed new ip addresses and login information and I am off.
What do you mean by data munging? Things like extracting data from XML files for instance?
It can mean anything, and that is why it is hard. Many old school statisticians can not work with anything other than csv, excel spreadsheets or basic sql queries. Munging is the conversion to a nice format that can then be used for analysis.
Understood. Yeah, this sounds like a job for awk or some such specialised tool.