Hacker News new | ask | show | jobs
by frogblast 1887 days ago
Are there any good resources out there describing in practice how existing training workloads are distributed among GPUs? (using tensorflow, pytorch, or whatever else?).

I'm curious how the problem effectively gets sliced.

1 comments

SOTA on the biggest language models (which is where effectively the largest models are) is here: https://www.microsoft.com/en-us/research/blog/zero-infinity-...