| HN Mirror

That was more of a WSE-1 problem maybe? They switched to a new compute paradigm (details on their site if you look up "weight streaming") where they basically store the activation on the wafer instead of the whole model. For something very large (say, 32K context and 16k hidden dimension) this would make an activation layer only 1-2GB (16 bit or 32 bit). As I understand it, this was one of the key changes needed to go from single system boxes to these super computing clusters they have been able to deploy.

The Nvidia bandwidth to compute ratio is more necessary because they are moving things around all the time. By keeping all the outputs on the wafer and only streaming the weights, you have a much more favorable requirement for BW to compute. And the number of layers becomes less impactful because they are storing transient outputs.

This is probably one of the primary reasons they didn't need to increase SRAM for WSE-3. WSE-2 was developed based on the old "fit the whole model on the chip" paradigm but models eclipsed 1TB so the new solution is more scalable.