| HN Mirror

This is highly dependent on exact model size, architecture and hardware configurations. If the compute time for some unit of work is larger than the time it takes to transfer the next batch of Params you are good to go. If you are doing it sequentially though then yes you will pay a heavy price, but the idea is to fetch a future layer not the one you need right away.

As a similar example I have trained video models on ~1000 H100 where the vast majority of parameters are sharded and so need to be first fetched on the network before being available on HBM, which is similar imbalance to the HBM vs SRAM story. We were able to fully mask comms time such that not sharding (if it was even possible) would offer no performance advantage.