| So here's the thing I struggle with. I do a lot of work in jupyter notebooks. I come up with a new model or approach to some problem, and I want to fork out and test a hypothesis in the background (which might be some set of hyperparameters, and might take several minutes, or hours; call it Run A) while continuing to work down some other path in the same notebook, and maybe kick off a Run B that explores some other change (like a restructure of the code that's not "compatible" with the hyperparameter search of Run A). Then at some point when Run A finishes, I want to incorporate the changes I made in Run B and kick off Run C, and so on. The hard/important things are: 1) Being able to do this while staying in a Jupyter notebook context the whole time. Even something as simple as multiprocessing sucks because I've found it's too hard to manage in a Jupyter context (e.g. how do you handle where stdout and stderr go?). It's easier if you move to scripts where you have full support for this sort of thing and you are expecting to look at multiple log files on disk and whatnot. Also the sequential nature of notebooks doesn't help when you want to occasionally fork out or conditionally run stuff. 2) Keeping track of all these changes and hypotheses and merging the results/code together as you learn. It's like you need a VCS for your hypotheses. Maybe hydra & wandb help with that, I haven't used them. But this idea of keeping track of hypotheses seems like the more fundamental thing. 3) The main reason I prefer to stay in a notebook context is because I have all my objects easily accessible. My models, all my dataframes, functions to do some ad-hoc charting etc, all super easy to access in a REPL-like form. That is invaluable for doing ad-hoc sanity checks or digging/drilling down. So a big part of the workflow is you basically have this in-memory database of a bunch of relevant objects and you're querying it and constructing new objects & visualisations using Python as your tool, without having to load things from disk or build up the context from scratch. It's all "just there". 4) And then sometimes you want to take the results X1 of that notebook and plot them against some entirely different set of data X2 that requires a whole bunch of other code that you've defined in some other notebook somewhere, or maybe even as a real Python module. Like maybe that data lives in a database and you transform it or something. So OK, you call some functions to load X2 within your original notebook, but BOOM you get an OOM and you're like ok now I have to write some code to serialise X1 to disk, and make YET ANOTHER notebook so I can go analyze X1 and X2. It all just seems so... unnecessary, if only the right tooling existed. My current best approach is to use semantic versioning on the filename, just copy the whole notebook each time I make a fundamental change, and try to keep track of my hypotheses, preconditions, learnings etc within comments and have a few of those on the go running, but it's often hard to engage in critical thinking when everything you know is sprawled across multiple notebooks. Maybe a simple global journal is the only thing for this sort of use case. And that doesn't even address (4) which is often a huge pain point. Can anyone think of something better? |
How I tie those threads together is by the data that they generate. I use ASDF because it works for the kind of stuff I'm going, but choose your poison. Once the data are in the bag, the cells that analyze or report the results can stay in the same notebook, or be copied into your main notebook. My data aren't so huge that there's much of a penalty in re-loading them.
For me, reproducibility is more important than organization, because I'm not all that organized anyway. So, a single master notebook at the end of a study isn't my top goal.