Hacker News new | ask | show | jobs
by p1esk 1988 days ago
I do something very similar (five 8xGPU servers, Pycharm, ssh, tmux), and I have no solution to the issues you described. I manually launch one ssh/tmux session per server and typically have multiple tmux panes, with nvidia-smi and htop outputs. I keep reconnecting to these ssh/tmux sessions to monitor progress. I also save the results of experiments to text files, so that at the end of the hyperparameter search I can just look at those files. Looking at files is sometimes easier/quicker than looking through tmux sessions (files are kept in shared storage).

I've seen plenty of experiment management tools being advertised, but every time I looked at them they were either very limited, or required significant restructuring of my code or my workflow.

I'd like to hear about whatever solution you find because I agree, this does get tedious and painful sometimes.