Hacker News new | ask | show | jobs
by kenward 1875 days ago
Slightly tangent, but has anyone figured out a good solution for version controlling jupyter notebooks?

The closest thing that we've found has been to use the notebook percent format in a simple .py file [0][1]. It plays with git much nicer than an .ipynb and it is still interactive enough for rapid prototyping. However, it would be nice to have some first-class support from Jupyter on this.

[0] https://jupytext.readthedocs.io/en/latest/formats.html?highl...

[1] https://code.visualstudio.com/docs/python/jupyter-support-py

10 comments

Until recently, I also thought this was a major problem. My old solution was to always pair .ipynb files with proper .py modules of the same name, so the .ipynb always starts with `%run foo.py` and just calls functions.

However, I recently started using VSCode Insiders the preview release of VSCode, which has amazing support for Jupyter notebooks in the editor. You can use your normally configured linters, auto-formatters, vi-mode keyboard shortcuts (major selling point for me). You even get legible, cell-aware diffs when comparing in git. Now, the edit/git workflow for .ipynb files is so close to parity I've stopped caring whether code is in a proper Python module or not, and I almost never run Jupyter Notebook or Jupyter Lab in the browser.

In some ways this is a loss, because it's nice to have project-wide linting and other tooling that only work with .py files, but using `%run` was always an imperfect abstraction. By default, `%run` executes modules in a new module namespace which is then copied over, so it's not exactly "paste into Jupyter" unless you do `%run -i`. Even then, it's limited by running all at once. Every cell in a typical Notebook is effectively a button that runs an `exec()` statement, and you can't achieve those semantics by calling Python functions.

I've been using the VSCode Insider's release as well and have been loving it and the new native notebook features for all the reasons you've listed already.

That's an interesting solution. I believe this is similar to what Joel Grus does [0], except %s/jupyter/ipython.

[0] https://www.youtube.com/watch?v=7jiPeIFXb6U

Forgive the snark because my suggestion is obviously not an improvement or even a match for target audience, but org-mode files with inline org-babel code-blocks is what I consider to be perfect version controlled notebook.

Pity Emacs is not the best on-boarding experience.

Worth noting that org files can also embed images inline. Unfortunately, I don't think it's enough to attract new users: images are non-interactive, you can't embed any other media, you have to pop a separate window to edit code, you can't embed rendered markdown (ie. headers and paragraphs have mostly the same font style/size), and so on. Sure org, babel, and calc give you a lot of other things to like, but that's if you're already an Emacs user.
Here[0] is a guide that explains syncing ipynb <> py files with Jupytext. I also add ipynb to `.gitignore`. It works well, although the file browser in Jupyter becomes cluttered with every notebook file being doubled. It'd be great to hide the underlying py files.

[0] https://github.com/mwouts/jupytext/blob/master/docs/paired-n...

So far, we try to just use Jupyter notebooks for experimentation, scratch, and poking around.

Real, working code gets checked in to a repository. The only reason to go back to an old notebook after that point is maybe to see how you may have experimented with or poked at some data.

Shameless plug, but with Nextjournal you can use git directly: https://github.nextjournal.com/

And apart from that, we have a normal github component to load your code from a github repository: https://nextjournal.com/help/github

Could you share some of your experience with nbdev? I'm a huge fan of what the fastai team has been doing and I've tried nbdev, but I haven't been convinced yet. Particularly with the pull request experience, it's not very easy to do code reviews.

FWIW my team uses bitbucket and the PR experience is significantly worse than github/gitlab unfortunately.

I just keep them as .ipynb files and then use git's filter and smudge features together with nbconvert's "clear notebook output" preprocessing, ensuring only clean notebooks get added/diff'd/committed.
I think jupytext is already as close to "first-class" support as you're going to get. Personally, I'd be happy to see the project included with Jupyter, but you'll have to pester the Jupyter devs for that ;-)
I would love to see that as well, I'm wondering what has stopped them from integrating it already... Maybe there's room for some contributions from the community here :)
You could try a preprocessing step of converting them to markdown with Pandoc or nbconvert. In Git, you can configure custom diff tools for certain file formats.
You can configure so they don't save their output, then the ipynb diffs will be readable.
Makes me wonder if you could split an ipynb into an in and out file, then add /*.out to .gitignore.
I believe I achieved the no-saving-output affect by adding a python snippet/plugin to Jupyter Lab. So you could program it to do whatever you want. That's what I love about Jupyter Lab, you can turn it into whatever kind of environment you want.
I like this idea, but seems a little backwards. Normally you commit the _source_ and omit the _artifacts_ haha.
That's what they are saying? Doesn't seem backwards to me.
Oh, you may be right. I interpreted it as having a separate build step to generate the *.out files.