Hacker News new | ask | show | jobs
by mistermann 2328 days ago
Can you elaborate why, I've heard mostly love for them.
1 comments

My main gripe with them is that the dependencies between the cells are implicit and the results of each cell are resident in memory instead of a more durable form (like artifacts on disk).

The best way to understand why I hate notebooks is to contrast them with my workflow: when I do data science, each step in the process is represented by a Makefile target which lists its products and its dependencies explicitly. All products are represented as concrete objects on disk (json, csv, figures, or serialized representations in some form).

If I need to do reporting, its typical for various targets to generate fragments of latex and for the report (a pdf) to explicitly document which fragments of latex and which figures belong in the report.

Them a simple `make report.pdf` is enough to generate the final result. If I change something I can explicitly see which pieces need to be rebuilt and how.

I also believe that the structure of a notebook, which mixes code and reporting, encourages bad software design practices like copy-pasting - it doesn't naturally encourage refactoring of shared code into libraries or anything like that. Most jupyter notebooks are just a pile of shit, basically. The big problem is that all the cells in a notebook share one, big, mutable, global state. This is wrong.

They also don't work well with git, which I view to be the absolute crux of any successful technical project.

This is the big difference between 'data scientist' and 'software engineer'. Your workflow is almost certainly repeatable and verifiable, which is just better science.

The crux is is that Jupyter was not made for people with your skills, or more accurately, pattern. What you do is not inherently hard or complicated. Certainly not more than actual data science. It's a cultural thing mostly.

I'm prepared to argue that a data scientist who doesn't apply this level of rigor to their work is almost certainly doing a bad job, though they may deliver results that are good from moment to moment.

The key differences is how traceable and repeatable the process is.