| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by PaulHoule 2141 days ago

How do you deal with differential versioning of code and data, and the fact that people don't always execute notebooks from top to bottom?

For instance, suppose I have a notebook that takes 2 hours to generate a model. From the viewpoint of explaining it I'd like to make a notebook where I start from the beginning, train the model, then use it.

If I want to show it to people I want to save all the results and re-render them, not rerun the calculation, certainly if I want to show off the results in a 1 hour talk!

From the viewpoint of reproducibility, however, you have to be able to run the notebook from top to bottom and get a 'correct' result. I'm not going to say the 'same' result because many calculations are stochastic in nature (e.g. random numbers) or because often the data changes. (Let's say I have somebody make a notebook that does April's sales reports -- shouldn't I just be able to point it to the may data to make May's sales reports?)

Between the long time delays (longer than people can hold a context in their mind, longer than they want to wait) for the system to settle down and the total complexity I find that many people involved with data science violently resist confronting the above issues. The effects are much like the visual "blind spot" -- you might get a series of projects that were 98% completed but didn't quite deliver business value although everybody feels like they did their part.

Like other vendors in this crowded space, dstack leads with technology as the key problematic "e.g. supports Python and R", "matlib, Tensfolow, plotly, ..."

It's certainly true that people don't want to face up to reality in that area. Maybe 50% or 90% of the "waste" in the area involves setting your dependencies up, begging your boss to get you access to "the cloud of your choice if that's what's needed". The trouble with is that investment in particular technologies are of temporary value (maybe people will still be using R in 2030, maybe they won't be using Tensorflow, almost certainly plotly gets bought by Google and shut down by then)

Years back I researched the problem of running Tensorflow models that we got off the pavement, building a database that says TF version X depends on CUDA version Y, CNN version Z, and being able to have multiple copies of the userspace GPU drivers installed simultaneously (e.g. just put 'em in a directory and set the library path to point at 'em -- don't even need containers!)

I could have sworn Google looked at my source because they did the one thing that could have broke that strategy. Also the company I was working for lost interest in that particular shiny thing. That's a basic problem with maintaining a distribution of other people's software -- like treading water it takes effort just to stay in one place.

The more fundamental problems that turn up in going from data to decision and products are eternal and not tied to a particular technology. If you solve those problems rather than chase the shiny you might break out of the pack.

1 comments

peterschmidt 2141 days ago

I agree with your point. Reproducibility and versioning is an important yet ver challenging topic right now and not many seem to help with it. And it might be that the problem is not specifically about tools but rather the mindsets and workflows.

IMO dstack is a lot about process. Technologies can change. The process often stays. We’d like to find the best way to solve problems people face every day regardless a particular technology.

One more little thing which might be relevant is that dstack actually tracks revisions. What we haven't figured yet out is how to link the particular revision of the applications with the particular revision of the code / notebook.

link