Hacker News new | ask | show | jobs
by dahart 1495 days ago
Can you talk more about why you’re working in Jupyter Notebooks at a level that needs diff reviews? Are you reviewing your own work, or the work of others?

One option would be to start a policy to always “restart and clear output” before saving. This cleans the output cells and makes the .ipynb files diffable. Just happens to also make them nice for storing in version control.

Another option would be to work in pure python files in the first place, and only use Jupyter after the fact. The close brother to Jupyter is the Spyder IDE, which gives you most of the benefits of quick visual outputs, but also has a nice python debugger built in.

2 comments

edit: Googling reveals nbdime, has this been looked into? - https://nbdime.readthedocs.io/en/latest/

Not OP but I can imagine easily the need for what he's asking.

You'll find a lot of algorithms for data and image processing saved as notebooks these days offered to you. Let's say you make some changes from the provided code and after a handful of changes something is not working right. You might want to diff from where you are back to a working version in hopes that differences that emerge might clue you into where to look for where the problem might be.

As an aside, I want to say Jupyter notebooks (moreso jupyterlab) is sort of a disruptive change to our coding workflows. We've had interpreters for a long time sure, but creating interactive graphs on-the-fly is a godsend, insights come to you in such a workflow that wouldn't otherwise. I hope this catches on, I actually want my shell terminal to become more Jupyter-like. Also, fun fact: did you know you could do real-time collaboration on Juypter notebooks? https://jupyterlab.readthedocs.io/en/stable/user/rtc.html

Oh I can totally imagine use-cases too, but I’d love to hear what the OP’s use case actually is. I also agree completely on the disruption that Jupyter brings, and that it has just massive benefits. But when a workflow isn’t giving you everything you want, it’s worth evaluating whether the tools you’re using are the right tools for the job, right?

One example would be that Jupyter is well designed for a lot of prototyping and for single-person scenarios. It’s well designed for sharing and for including notes and narrative with code. It’s just not really designed for multi-user workflows. That’s not a negative in my book, it’s just a fact that makes me reach for a different tool when I need to collaborate.

Also don’t overlook Spyder, which is part of the same ecosystem as Jupyter, they’re usually bundled together, and Spyder gives you the interactive features you want but might better support a production workflow that is multi-user, collaborative, and also more easily diffable.

All that said, it might be awesome if someone builds a Jupyter diff tool that is designed to ignore the output cells!

Hey there - OP here. I haven't used spyder I'll have it check it out.

The primary use case is: I am a researcher in nlp where speed of prototyping is key. I work in an environment where research fragments are primarily jupyter notebooks. So needing to diff notebooks is typically reviewing my own changes when modifying my and others research sketches. Since its helpful to see how code changes.

What really resonates with me is what others have said which is I need to run cells that take 2-6 hours to compute so recomputing cells is annoying... I dont love notebooks for their messy state which cause obvious problems that are very annoying.. and I am not an advocate for notebooks for production for this reason but the flexibility of computing stuff and having that persist and doing downstream prototyping makes notebooks amazing! Markdown and latex in there is also really helpful.

The secondary use case is PRs but... typically reviewing others research code isnt at the granular level of notebook riffs across a few commits so it deosnt come up often.

> https://jupyterlab.readthedocs.io/en/stable/user/rtc.html

Wow! Realtime notebook collaborative editing! This is going to be so cool for teaching (allow students to fill-in part of the code block).

Have you tried this yes? Is the idea to run jupyter on a machine with a public IP and port 8888 open allowing the server to be accessed from multiple people at the same time? Would this work services like `ngrok` that make you personal computer available online?

Not OP but restart and clear output can be quite compute intensive if you're working with big datasets or training ML models. There are many ways to mitigate this like saving weights and only redo the inference but it's not always worth it when you're iterating through models and parameters or doing exploratory data analysis. Most of the time you want to just keep results/outputs of previous run and improve from there
That’s a great point, I sometimes avoid clearing outputs when I’m playing with Pytorch just because retraining takes a while. This has been motivating me to learn how to be fluent with saving weights to disk.