Hacker News new | ask | show | jobs
by vlovich123 254 days ago
I did experiments with this on traditional consumer GPU and the larger the discrepancy between model size and VRAM, the faster it dropped off (exponentially) to as if you didn’t even have any VRAM in the first place (over PCIe). This technique is well known and works when you have more than enough bandwidth.

However, the whole point that even HBM is a problem is the available bandwidth is insufficient, so if you’re marrying SRAM and HBM I would expect the performance gains to be overall modest for models that exceed available SRAM in a meaningful way.

1 comments

This is highly dependent on exact model size, architecture and hardware configurations. If the compute time for some unit of work is larger than the time it takes to transfer the next batch of Params you are good to go. If you are doing it sequentially though then yes you will pay a heavy price, but the idea is to fetch a future layer not the one you need right away.

As a similar example I have trained video models on ~1000 H100 where the vast majority of parameters are sharded and so need to be first fetched on the network before being available on HBM, which is similar imbalance to the HBM vs SRAM story. We were able to fully mask comms time such that not sharding (if it was even possible) would offer no performance advantage.