Hacker News new | ask | show | jobs
by carljv 2968 days ago
The Jupyter team deserves every accolade they get and more. The console, notebook, and now JupyterLab are some of the key reasons why Python's data ecosystem thrives.

I think Jupyter notebooks are quite useful as "rich display" shells. I often use them to set up simple interactive demos or tutorials to show folks or keep notes or scratch for myself.

That being said, I do think the "reproducibility" aspect of the notebook is overblown for the reasons other comments cite. Notebooks are hard to version control and diff, and are easy to "corrupt." I often see Jupyter notebooks described as "literate programs," and I really don't think that's an apt description. The notebook is basically the IPython shell exposed to the browser where you can display rich output.

This is where I think the R ecosystem's approach to the problem is better (a bit like org-mode & org-babel). For them, there is a literate program in plain text. Code blocks can be executed interactively and results displayed inline by a "viewer" on the document (like that provided by RStudio), but executing code doesn't change the source code of the program, and diffs/versions are only created by editing the source. At any point, the file can be "compiled" or processed into a static output document like HTML or PDF.

This is essentially literate programming but with an intermediate "interactive" feature facilitated by an external program. RMarkdown source doesn't know its being interacted with or executed, and you can edit it like any other literate program.

Interaction, reproducibility, and publication have fundamental tensions with each other. Jupyter notebooks are trying to do all three in the same software/format, and my sense is that they're starting to strain against those tensions.

2 comments

Notebooks can be reproducible, they just aren’t automatically so. It requires a little bit of effort and discipline, if reproducibility is a goal. https://www.svds.com/jupyter-notebook-best-practices-for-dat... is an excellent starting point. Personally, I use notebooks to keep a record of large computational pipelines. The key is to cache all results to disk. This allows for an iterative process where I modify the notebook, kill the kernel, and rerun everything. Only new calculations will be executed, everything previously calculated will simply be loaded from disk. In the end, I have a reproducible record of the entire project (and rerunning the notebook is fast) This kind of make-like functionality is implemented through the doit Python package (http://pydoit.org). An example workflow for this is http://clusterjob.readthedocs.io/en/latest/pydoit_pipeline.h...
So basically you write a script instead of a notebook? If you save data on disc, are they still displayed with the rich formating of Jupyter?
Well, it also contains markdown comments (often with LaTeX formulas, so the graphical rendering is appreciated), and, most importantly, plots of the results (which are typically fast to generate, so they are not cached).
I agree, 120%.

I like the r approach so much more.

I mean, as a medium for interactive exploration where you might want graphs and widgets or other rich/dynamic output, I still think the notebook is superior. But as a medium for developing complete, share-able, reproducible data analyses, I do think R has the upper hand.
Graphs, widgets and other rich/dynamic output is also possible with the R approach.

https://rmarkdown.rstudio.com/

Additionally, Rstudio is an incredibly powerful IDE for data analysis.

EDIT: Interestingly, however, I still use ESS https://ess.r-project.org/ but that's because I love Emacs too much :D

I understand. I believe I pointed that all out my comment above. I wasn't saying that I find the notebooks superior because they allow for rich & dynamic output, but that I find it superior to RStudio when all you want is a quick exploratory REPL capable of rich/dynamic output. I simply find it easier to fire up a notebook and start noodling around than writing an RMarkdown notebook. That really only holds if I'm not overly concerned with keeping or sharing the notebook. Otherwise, I believe RMarkdown is the better option.

I also tend gravitate towards ESS, and probably split my R development time between emacs and RStudio. I've even written a very kludgy Rmd notebook mode that uses overlays to show evaluation results from code chunks. But RStudio is very well-designed and ESS just doesn't compare feature-wise, sadly.

I just like python pandas better than R.
Not me. I'd take dplyr and related libraries over pandas any day. I've been using pandas for 6 years and I'm still regularly tripped up by parts of its API.