Hacker News new | ask | show | jobs
by legerdemain 1880 days ago
LOL, how often do you want your entire notebook to recompute just because you change something somewhere? Have you never tried pursuing a little side experiment in an existing notebook, or have ten abandoned false starts leading to one good result? I have many extremely long notebooks that would almost certainly crash if you tried to recompute the whole thing, and many of the cells won't work at all because the inputs are long gone. Some of these notebooks are years old. The datasets they have in memory aren't saved anywhere else. What possible motivation do I have to lose all of this precious state?

If I wanted a software-grade, rock-solid data pipeline, I would just copy-paste some code from an existing notebook and run it on Papermill.

3 comments

Some of these notebooks are years old. The datasets they have in memory aren't saved anywhere else.

That sounds dangerous to me. If your computer crashes or you introduce a bug to your notebook, you could lose all that data. Personally, I prefer my notebooks to be reproducible at any point.

Exactly, or at the very least, pickle/serialise/export/whatever the models so that the computer can survive a reboot.
These are usually small aggregates and summaries, so I just display them in notebook output. It does make it take a bit longer to scroll through the notebook to find something, but that's what being disciplined with organization is for.
Sorry, I'm not sure I'm following your argument. Are you saying your notebooks hold state that's easily reconstitutable, and so it's not actually such a big deal to regenerate your "precious state"?
No worries, apology accepted! You misunderstood what I wrote: parts of my state are small enough for me to print() them in a cell and use the output as reference.
The whole notebook doesn't recompute only cells that are dependent on the cell that changed. This is extremely powerful because you never end up with stale cells that are showing incorrect values.
This is extremely counterproductive, because I want results you're calling "stale" to use as a reference or inspiration. I don't want to destroy old results just because I changed some parameter value to test an idea.
It improves reproducibility, consistency, and sharing, but reduces convenience for some operations. It's a trade-off in favor of programming in the large.

If you don't want to recompute dependent nodes, then use new names for your experiments rather than redefining old functions and variables. Yes, in some ways this is less convenient for you, but it's more convenient for people receiving your notebooks, that the notebook is always in a consistent state and reproducible.

Maybe it doesn't work well for your workflow, particularly if you're not sharing notebooks and keeping your notebooks small. On the other hand, if your workflow requires significant amounts of leaving notebooks in an inconsistent state, you may end up saving yourself significant frustration with larger notebooks and losing work due to losing track of your mental tracking of inconsistencies.

Also, if you hit a state that you really don't want to lose, you should probably do a quick git commit. You can always squash commits later if needed.

It might be worth changing your workflow, or it might not.

I think this is the interesting point though. Many people want to use Jupyter notebooks so that it looks reproducible. Not to make it actually reproducible. God forbid it actually has to be re-ran, it could have different results!

I think that's my main notebook gripe: they make it look like if you run the code you'll get these results, but that's not even close to the case. Many people abuse this. At this point, I pretty much assume anything in a Jupyter notebook isn't reproducible.

Yes. A Jupyter notebook is only reproducible in my opinion if you can hit "Restart Kernell and execute all cells" and get the same result.

Otherwise, it should never been shared with other people or even contain relevant analysis you may need for yourself later.

But this is not enough - also the library dependencies need to be fixed. Pluto will make this very easy in the near-future: https://github.com/fonsp/Pluto.jl/pull/844

If you’re so attached to that data, you should probably do something to save it other than let it sit in RAM or maybe an old plot in a random notebook.
I'm not sure what exactly you're trying to do: harangue me into using this half-baked notebook replacement, or just telling me how to do my job.
Then instead of reassigning new data to foo, just assign new data to foo2. You can still use the notebook to experiment, what you are doing is removing ambiguity.
> how often do you want your entire notebook to recompute just because you change something somewhere?

This is exactly what I want, always. In Jupyter I'm continuously doing the restart kernel and re-run all cells dance. It is annoying and I love another system optimized for that like Pluto, without those stupid non-deterministic cells.