Hacker News new | ask | show | jobs
by Jugurtha 2101 days ago
- What's causing the most trouble for the users in terms of frequency and severity?

- Which one could have the biggest impact and how much work to implement it (hard/long)?

- Which one if done will unlock a graph of possibilities for features to come, or other issues becoming irrelevant once that's done?

- Are these issues materializations of the same underlying problem?

Example: our users were constantly competing for resources to train their models with notebooks in a tragedy of the commons way. GPU/RAM, etc. Notebook users are familiar with "OOM" errors. You start training using a notebook, someone else does the same, bam!.

The thing to do was clear: schedule notebooks[0]. It doesn't have to be fair queuing, even FIFO does the job. Just put the notebooks in a queue and execute them sequentially. No priority.

The issues might have said "Out of Memory" errors, and a naive approach would be to simply add resources but is that really the underlying problem? No.

Once you have notebooks running asynchronously, it opens a whole new graph of possibilities and makes it easier to do things properly, like leverage a proven workload manager. The limiting step was to de-couple the intent to execute, from the execution, and wrap it in a request. Now you can re-run it, delegate it to a more powerful node, etc.

One other issue it solves is something notebook users are familiar with, too: you can't access the results of a computation if you close your browser/tab/computer. If the connection between the kernel and the front is lost in Jupyter notebooks, the computation continues, but the results don't make it to the front. You could have a training job that takes hours, if the connection is lost, so is your result. Notebook users solve this by saving their model to disk, but notebook users aren't using notebooks just for that. They want graphs and visualizations in the notebook too, or else they would have just used a script. Also, that is possible but we do automatic model/parameter logging so our users don't bother with that, so they could always click on "Deploy" and their model is deployed.

Focusing on scheduling notebooks solves that. Now even if users closed their laptops, or the connection was lost, the result of the notebook would still be there.

Then we went further: displaying the results of the run outside of Jupyter, and now you can show a client the result of the work, and you can discuss it.

So that was an example of handling an issue that "fixes" a problem like crashes, and indirectly unlocking possibilities.

Now examples of handling issues/tickets implementing a "feature" where there are no crashes: users worked in tandem, and sometimes asked assistance and we helped them troubleshoot their code. One pattern is they shared "screenshots" of the code. We already had a sharing functionality with which they could simply share their notebook with another user, and the user could edit.

We added near realtime collaboration[1] to notebooks so several people could work together on the same notebook, see what everyone else was editing, following your cursor while you were doing so. Being able to work on the same document clearly simplified things for users who were not proficient with Git, which is also why we have added multiple checkpoints[2], as the default was one checkpoint.

So, a question to ask is "should we do this?", and then look at the tickets/issues that have high impact. Some of them will be harder to implement than others, and you have to balance your allocation, but at least you'll do it on something that matters to users.

[0]: https://iko.ai/docs/notebook/#long-running-notebooks

[1]: https://iko.ai/docs/notebook/#collaboration

[2]: https://iko.ai/docs/notebook/#multiple-checkpoints

1 comments

what are you talking about?
OP's question was about how to manage a project and and how to decide which order to do issues that are interdependent in.

The reply was whether the issues needed to be resolved, and gave a set of questions to ask to find out what to work on and what not to work on, and gave examples from an actual real world project in both situations [fix a problem / add a feature].

In other words: how to manage a project and how to decide what to do and what not to do.

Which part is confusing?

I got it. Thank you.
My pleasure.