|
|
|
Ask HN: How do AI labs setup their infrastructure to train large models?
|
|
3 points
by true2octave
1101 days ago
|
|
At my company I have to do this task, and so far I have seen slurm-based cluster setup (v100s or h100s), some fast distributed file system, Docker for containers and PyTorch with DDP strategy. But I read somewhere kubernetes can also be used. And their singularity as Docker alternative. Where can I learn more about this? |
|