Hacker News new | ask | show | jobs
by Gatsky 4223 days ago
How do you feel about reproducible computing in python? R is very well set up to A) get it running on any platform easily B) report the crucial parts of the environment. I know that if I grab someone else's (published) code written in R, I'm pretty confident I can make it work. Part of this is the great package management through CRAN or Bioconductor, and also because often important reference data for bioinformatics is actually available through the package manager.

I haven't done much with Python, but I don't quite get the same feeling (happy to be told that the reality is otherwise!). For example, the opening line of the installation guide for Pandas doesn't inspire great confidence in me: "The easiest way for the majority of users to install pandas is to install it as part of the Anaconda distribution, a cross platform distribution for data analysis and scientific computing."[1] Do I really need to install the HDF5 package so I can split a concatenated variable into two columns??

[1] http://pandas.pydata.org/pandas-docs/stable/install.html

4 comments

The thing w/ reproducible research (I was an early BioC core member and have worked directly w/ its RR advocates) is that it requires having an exact set of R and packages. I know that BioC tries to do this (I wrote the original BioC package download script) but weird things can still happen. A few years ago I was tracing down a bug in some computational biologist's code that really traced down to some wacky version of a particular package which might be downloaded in the right circumstances.

In a previous life what we did was for every project you'd download a snapshot of an R environment, including all packages. That, and only that, was used for all computation for everything involving that project from start to finish. If Docker was around at the time, that's what we'd have used.

Thanks for your work with BioC, it's fantastic. I use it a lot in my cancer genomics research. Part of that involves providing a service to patients living with cancer, so your work is definitely out there having an impact!
Thanks but I haven't been a contributor for a decade, I just had a hand in the early days. I agree though that it's a phenomenal suite for the bioinformatics world and an exemplar of proper R techniques
Python's pip is pretty good though not quite as polished as CRAN. I have had few problems running complex code from third party sources, though one always has to be aware of the Python 2 v 3 "problem" (though it is diminishing now with most things available on 3). If you get pip up and running on a new Python installation you can avoid Anaconda/Canopy if you want a clean installation, and I have installed fairly complex Python setups in multiple locations without too much trouble. Let's be fair, R can also be tough if it calls a lot of third party libraries. Just try to get rJava working properly for example if the local R and Java installations are not both 32 or 64 bit. It can be a complete mess to disentangle this sort of stuff in R. Or for example running code that uses Cairo, on a mac. My experience is that Python's poor package management reputation is not really deserved anymore. Python's virtualenv also allows you hermetically to seal away an entire python environment, including its libraries, so that it will not conflict with other python environments that might have different versions of the interpreter and/or libraries. I am not aware of anything this robust in R.

Reproducible computing? The ipython notebook is awesome, though I am not sure if there is anything as good as knitr if your workflow is LaTeX oriented.

R "hands" will usually find Python a backward step when it comes to vectorized data manipulation, but its a forward leap if your data becomes too big or if you have to step out of the comfy environment of exploratory analysis into any form of (even trivial) production settings.

And no you definitely do not need HDF5 to effectively use Pandas.

The closest equivalent to virtualenv for R is packrat: http://rstudio.github.io/packrat/. It doesn't (yet) support different R versions for different projects, but that's on the roadmap.
Yeah packrat is great! It is a really important package which has greatly increased my willingness to use R in production.
Ok that's good to know. Sure, R breaks inexplicably sometimes due to dependencies, no doubt about that.

virtualenv sounds useful. Is it used much when python code is published in a paper?

About HDF5: I was just making the point that the Pandas docs recommend I install Anaconda to get Pandas, thus also installing HDF5. I am sure there are other ways, but the way the documentation is phrased suggests that these other ways are overly difficult.

I'm just learning Python to do some data and graph ananlysis experiments. Should I go with Python 2 or 3?
You are strongly encouraged by the Python powers that be to move to 3, and I have only in the past few months begun to agree with them, and that is because some serious standard libraries like asyncio are now only available on 3. It's (finally) the future. However a big caveat is that if you're learning Python, most of the sample code you will find on the web will be 2-based and will not work well under 3. It's not so much the print statement, but range() works subtly differently too now (return a generator not a list - too subtle for beginners to properly understand in my view) and unicode strings can break older code too. Just be aware of these things and move to 3 is my (51/49) advice, but this is a controversial point and others will have differing points of view.
I find knitr easy to use. They way it generates graphs and can output to pdf/html is really useful and is reproducible and easily shared. While essentially just markdown + R code the code can point to data sets instead of having it embedded. It has a good set of graphing libraries (ggplot2, etc) too. I can see how this could be the killer app that gets social science research papers written and produced in knitr. I always thought IPython would take this crown but R/knitr is looking good. Have not used Shiny yet

Edit:knitr not rdoc

You don't have to install the entirety of anaconda. You can install miniconda (from here: http://conda.pydata.org/miniconda.html) and then do `conda install $package_name` or, if like me you like to create separate environments for separate projects... `conda create --name $environment_name python; source activate $environment_name; conda install $package_name`

disclosure: I work on miniconda. I'm currently working on improving our developer experience. Complaints are welcome.