Hacker News new | ask | show | jobs
by mirker 1164 days ago
Isn’t the simpler example data parallel training? That’s a reduce and broadcast.
1 comments

You’re correct, I was going to add on to my answer that this is combined with DDP for a “3D” style parallelism that could specifically benefit from the TPU’s network topology, but by the time I got done fixing all the typos (and still missing a few) from writing it on my phone I completely forgot :)