|
|
|
|
|
by lazylizard
100 days ago
|
|
i am not sre, merely sysadmin. and somehow i have this impression that gpus on slurm/pbs could not be simpler. u can use a vm for the head node, dont even need the clustering really..if u can accept taking 20min to restore a vm.. and the rest of the hardware are homogeneous - you setup 1 right and the rest are identical. and its a cluster with a job queue.. 1 node going down is not the end of the world.. ok if u have pcie GPUs sometimes u have to re-seat them and its a pain. otherwise if ur h200 or disks fail u just replace them, under warranty or not... |
|