|
|
|
|
|
by nsteel
437 days ago
|
|
Wouldn't a 3d torus network have horrible performance with 9,216 nodes? And really horrible latency? I'd have assumed traditional spine-leaf would do better. But I must be wrong as they're claiming their latency is great here. Of course, they provide zero actual evidence of that. And I'll echo, what even is an AI data center, because we're still none the wiser. |
|
That said, the torus approach was a gamble that most workloads would be nearest-neighbor, and allreduce needs extra work to optimize.
An AI data center tends to have enormous power consumption and cooling capabilities, with less disk, and slightly different networking setups. But really it just means "this part of the warehouse has more ML chips than disks"