|
|
|
|
|
by patagurbon
649 days ago
|
|
There's a pretty big difference between the workloads that these supercomputers run, and those running big LLM models (to be clear, hyperscalars also often have "supercomputers" more like the DoE laboratories for rent). AI models are trained using one of {Data parallelism, tensor parallelism, pipeline parallelism}. These all have fairly regular access patterns, and want bandwidth. Traditional supercomputer loads {Typically MPI or SHMEM} are often far more variable in access pattern, and synchronization is often incredibly carefully optimized. Bandwidth is still hugely important here, but insane network switches and topologies tend to be the real secret sauce. More and more these machines are built using commodity hardware (instead of stuff like Knight's Landing from Intel), but the switches and network topology are still often pretty bespoke. This is required for really fine-tuned algorithms like distributed LU factorization, or matrix multiplication algorithms like COSMOS. The hyperscalars often want insane levels of commodity hardware including network switches instead. The AI supercomputers you're citing are getting a lot closer, but they are definitely more disaggregated than DoE lab machines by nature of the software they run. |
|