Hacker News new | ask | show | jobs
by leeber 4223 days ago
He forgot the last one: "to use python instead"

That was my attempt at a joke. But seriously, I love python when it comes to manipulating data and doing anything statistical. You've got numpy, scipy, scikit-learn, etc.

5 comments

As a student of statistics, I'm kind of split on R. On one hand, it's just not a very well-designed language. The fact that it has three (!) independent object systems is a testament to this. On the other hand, as vegabook also mentions, working with vectors and matrices is just a lot more natural in R than in general-purpose languages like Python, because R's syntax has been built from the ground up to work with the kind of structures you usually work with in science.

I'm hoping Julia might become a good alternative to R and Python, but I can't see it catching on in the statistical community anytime soon given how many people are still using relics like SAS and Stata. The raw fact is that statisticians (considered as a group) just aren't very good at programming (and many older statisticians can't program at all), which means that a well-designed programming language may not necessarily be easy to use for a member of the statistical community used to point-and-click statistics suites.

I think R is a well-designed language. It definitely has its quirks (what language doesn't?), but by-and-large they are problems with the standard library, not the language. This is admittedly a subtle distinction, but it's much easier to fix problems with the standard library than it is with the language.

Three aspects of the language that make R particularly well suited for statistical programming are:

1) Missing values built in at a fundamental level.

2) Metaprogramming capabilities. The best way to solve many categories of data analysis problems is to design a domain specific language which allows you to easily combine independent pieces. R's incredible flexibility is great for this.

3) Fundamentally vectorised and functional. This allows you to elegantly express many common data analysis tasks.

Could you describe what facilities of R help with metaprogramming and make it good for designing DSLs?
http://adv-r.had.co.nz/dsl.html and the two prior chapters
How do you feel about reproducible computing in python? R is very well set up to A) get it running on any platform easily B) report the crucial parts of the environment. I know that if I grab someone else's (published) code written in R, I'm pretty confident I can make it work. Part of this is the great package management through CRAN or Bioconductor, and also because often important reference data for bioinformatics is actually available through the package manager.

I haven't done much with Python, but I don't quite get the same feeling (happy to be told that the reality is otherwise!). For example, the opening line of the installation guide for Pandas doesn't inspire great confidence in me: "The easiest way for the majority of users to install pandas is to install it as part of the Anaconda distribution, a cross platform distribution for data analysis and scientific computing."[1] Do I really need to install the HDF5 package so I can split a concatenated variable into two columns??

[1] http://pandas.pydata.org/pandas-docs/stable/install.html

The thing w/ reproducible research (I was an early BioC core member and have worked directly w/ its RR advocates) is that it requires having an exact set of R and packages. I know that BioC tries to do this (I wrote the original BioC package download script) but weird things can still happen. A few years ago I was tracing down a bug in some computational biologist's code that really traced down to some wacky version of a particular package which might be downloaded in the right circumstances.

In a previous life what we did was for every project you'd download a snapshot of an R environment, including all packages. That, and only that, was used for all computation for everything involving that project from start to finish. If Docker was around at the time, that's what we'd have used.

Thanks for your work with BioC, it's fantastic. I use it a lot in my cancer genomics research. Part of that involves providing a service to patients living with cancer, so your work is definitely out there having an impact!
Thanks but I haven't been a contributor for a decade, I just had a hand in the early days. I agree though that it's a phenomenal suite for the bioinformatics world and an exemplar of proper R techniques
Python's pip is pretty good though not quite as polished as CRAN. I have had few problems running complex code from third party sources, though one always has to be aware of the Python 2 v 3 "problem" (though it is diminishing now with most things available on 3). If you get pip up and running on a new Python installation you can avoid Anaconda/Canopy if you want a clean installation, and I have installed fairly complex Python setups in multiple locations without too much trouble. Let's be fair, R can also be tough if it calls a lot of third party libraries. Just try to get rJava working properly for example if the local R and Java installations are not both 32 or 64 bit. It can be a complete mess to disentangle this sort of stuff in R. Or for example running code that uses Cairo, on a mac. My experience is that Python's poor package management reputation is not really deserved anymore. Python's virtualenv also allows you hermetically to seal away an entire python environment, including its libraries, so that it will not conflict with other python environments that might have different versions of the interpreter and/or libraries. I am not aware of anything this robust in R.

Reproducible computing? The ipython notebook is awesome, though I am not sure if there is anything as good as knitr if your workflow is LaTeX oriented.

R "hands" will usually find Python a backward step when it comes to vectorized data manipulation, but its a forward leap if your data becomes too big or if you have to step out of the comfy environment of exploratory analysis into any form of (even trivial) production settings.

And no you definitely do not need HDF5 to effectively use Pandas.

The closest equivalent to virtualenv for R is packrat: http://rstudio.github.io/packrat/. It doesn't (yet) support different R versions for different projects, but that's on the roadmap.
Yeah packrat is great! It is a really important package which has greatly increased my willingness to use R in production.
Ok that's good to know. Sure, R breaks inexplicably sometimes due to dependencies, no doubt about that.

virtualenv sounds useful. Is it used much when python code is published in a paper?

About HDF5: I was just making the point that the Pandas docs recommend I install Anaconda to get Pandas, thus also installing HDF5. I am sure there are other ways, but the way the documentation is phrased suggests that these other ways are overly difficult.

I'm just learning Python to do some data and graph ananlysis experiments. Should I go with Python 2 or 3?
You are strongly encouraged by the Python powers that be to move to 3, and I have only in the past few months begun to agree with them, and that is because some serious standard libraries like asyncio are now only available on 3. It's (finally) the future. However a big caveat is that if you're learning Python, most of the sample code you will find on the web will be 2-based and will not work well under 3. It's not so much the print statement, but range() works subtly differently too now (return a generator not a list - too subtle for beginners to properly understand in my view) and unicode strings can break older code too. Just be aware of these things and move to 3 is my (51/49) advice, but this is a controversial point and others will have differing points of view.
I find knitr easy to use. They way it generates graphs and can output to pdf/html is really useful and is reproducible and easily shared. While essentially just markdown + R code the code can point to data sets instead of having it embedded. It has a good set of graphing libraries (ggplot2, etc) too. I can see how this could be the killer app that gets social science research papers written and produced in knitr. I always thought IPython would take this crown but R/knitr is looking good. Have not used Shiny yet

Edit:knitr not rdoc

You don't have to install the entirety of anaconda. You can install miniconda (from here: http://conda.pydata.org/miniconda.html) and then do `conda install $package_name` or, if like me you like to create separate environments for separate projects... `conda create --name $environment_name python; source activate $environment_name; conda install $package_name`

disclosure: I work on miniconda. I'm currently working on improving our developer experience. Complaints are welcome.

yes I have moved (back) to Python mainly because R is too slow when we get beyond a certain data size and the language is not powerful enough when data starts having to be moved around at scale. I have a 5-10 times speed improvement in native Python and another 30x more if I can vectorize things in Numpy. However a huge caveat is that R is much more succinct when it comes to exploratory analysis during what I call the "data rotation" phase because its vectorized nature is so much more efficient at selecting, reducing, cleaning and rotating data, than even Pandas can manage. It's irritating having to write list comprehensions constantly for what would often have been a ridiculously direct and efficient vectorized command in R. Moreover R's graphics leave matplotlib in the dust, though this advantage is eroding with the JS libraries taking over.

The other area where Python crushes R is if your data is live streaming. Here you inevitably need a full fledged programming language with proper asynchronous io capabilities and multithreading / multiprocessing that is not batch oriented.

Can you give an example of the a somewhat complicated vectorized command in R that would require lots of list comps in Python?
Totally agreed. I do model analysis on data sets with 200k-5m rows and anywhere from 500 to 20k columns. I originally started doing my work in R, but about two years ago, python started improving rapidly for heavy data analysis, and at the moment I'd say it's a clear winner.
For that kind of data or larger, I would avoid R and Python and move to writing my own algorithms or try out something for more heavy duty analysis such as Mahout or Spark. R and Python are still one box and memory constrained.
I know we don't reward snarky humor 'round these parts, but I was about to say the same thing. Python seems to own this space and the ecosystem around Python and math/stats/analysis is exceptionally healthy. If there's a specific place where R kicks ass please speak up -- it's fallen off my radar.
There are three areas where I think R is the clear winner:

1) An IDE for data analysis/programming: RStudio

2) Easy way to turn your analyses into reports: knitr

3) Easy way to turn your analyses into interactive webapps: shiny

(I also think R wins on visualisation and data manipulation, but I'm biased ;)

R absolutely wins on visualization and data manipulation. I'll spare you the immodesty :-)
I use both Python and R a fair bit. As a language, absolutely I prefer Python to R. However, I think there are two areas where R is better than Python and together, I think they add up to a durable advantage, at least for stats people. 1) Package support. Yes, Pandas and scikit-learn are good, but R still has a definite edge here. Here are three things I've needed lately where R has hands-down better code available: forecasting, frequent itemset mining, and network community detection. 2) Non-programming uses. There are a lot of tasks where you need a computer, but just to do one thing, a plot, calculate a statistic, ... stuff like that. R is better in that use case.
R is in some ways more forgiving to newcomers. Sure, there's all sorts of weirdness around how vectors and matrices work, and don't get me started on the cryptic function naming, but (1) almost all batteries are included -- hardly ever a need to hunt around for packages, (2) RStudio is really nice, with graphics, a shell, a text editor, documentation etc. all in one place, (3) it's mature and well-tested.

I prefer Python myself, but after spending a couple of months with R I do understand why people like it.

(OTOH I'll be a happy person if I never ever have to work with SAS ever again.)

> R is in some ways more forgiving to newcomers.

Oops! sorry sorry,... really sorry, apologies for snorting coffee over you, but given multiple years of experience TA'ing for machine learning / datmining courses I couldnt disagree more. R had them in absolute knots, and yeah they were asked to use RStudio if that helped. They struggled with simple things such as writing a naive Bayes classifier. Most of their mistakes were because of R's weird and silent inconsistencies: scalar or vector, copy or reference.

It is possible that all these 30 odd students every year were stupid but chances are fairly low.

EDIT:

The course has since switched to Java (Knime) and Python and that has gone a whole lot smoother.

Neither Java nor Python are my most favorite languages, but have to concede that Python is massively more consistent than R, so a student has to remember less of special cases, and the whipping boy of dearth of packages seemed less real at least in the context of the course. At least in the academic setting enthought / canopy / anaconda does a marvelous job of it.

I said more forgiving. It's certainly not a forgiving language or ecosystem in absolute terms, you're right on the mark there. But ultimately you have to pick your poison. Do you want to struggle with all of the various quirks of R or do you want to struggle with all of the various quirks of (data analysis in) Python?