Hacker News new | ask | show | jobs
by godelski 642 days ago
As a ML person who's also worked on HPC stuff, you will most certainly save money by doing this and there are plenty of benefits. It is generally a good idea, but there is a bit more barrier to entry and you need in house expertise.

So important piece of advice. If you can, hire an admin with HPC experience. If you can't, find ML people with HPC experience. Things you can ask about are slurm, environment modules (this clear sign!), what a flash buffer is, zfs, what they know about pytorch DDP, their linux experience, if they've built a cluster before, adminning linux, and so on. If you need a test, ask them to write a simple bash script to run some task and see if everything has functions and if they know how to do variable defaults. With these guys, they won't know everything but they'll be able to pick up the slack and probably enjoy it. As long as you have more than one. Adminning is a shitty job so if you only have one they'll hate their life.

There are plenty of ML people who have this experience[0], and you'll really reap rewards for having a few people with even a bit of this knowledge. Without this knowledge it is easy to buy the wrong things or have your system run far from efficient and end up with frustrated engineers/researchers. Even with only a handful of people running experiments schedulers (like slurm) still have huge benefits. You can do more complicated sweeps than wandb, batch submit jobs, track usage, allocate usage, easily cut up your nodes or even a single machine into {dev,prod,train,etc} spaces, and much more. Most importantly, a scheduler (slurm) will help prevent your admin from quitting as it'll help prevent them from going into a spiral of frustration.

[0] At least in my experience these tend to be higher quality ML people too, but not always. I think we can infer why there would be a correlation (details).

1 comments

Nice ideas, but we have chosen a really simple Kubernetes deployment. We only install the host OS (ubuntu server) and then join the self-hosted GPUs as workers in a Kubernetes cluster.

No other task is needed and our Grafana monitors if the server (and its containers) are up and running.

Sorry, my suggestion was if you need to do training. If you're only serving then the suggestions I made aren't as valuable and something like what you've done probably make more sense. But you want a proper cluster setup to do multigpu and especially multi node stuff
> "Would you mind sharing the name of the data center?"

Curious to know what you use other than grafana in your monitoring stack. We use prometheus for metrics/alerts and Loki/promtail for logs.