|
|
|
|
|
by p1esk
1988 days ago
|
|
I do something very similar (five 8xGPU servers, Pycharm, ssh, tmux), and I have no solution to the issues you described. I manually launch one ssh/tmux session per server and typically have multiple tmux panes, with nvidia-smi and htop outputs. I keep reconnecting to these ssh/tmux sessions to monitor progress. I also save the results of experiments to text files, so that at the end of the hyperparameter search I can just look at those files. Looking at files is sometimes easier/quicker than looking through tmux sessions (files are kept in shared storage). I've seen plenty of experiment management tools being advertised, but every time I looked at them they were either very limited, or required significant restructuring of my code or my workflow. I'd like to hear about whatever solution you find because I agree, this does get tedious and painful sometimes. |
|