That blog is about training. For inference, the weights and kv cache are in SRAM. Having said that, the $100M number is inaccurate/meaningless. It's a niche product that doesn't have economies of scale yet.
The blog is about training but the technique applies equally well to inference, just like FSDP and kv cache sharing are routinely done in inference on GPUs.
There is just no need to have parameters or kv cache for layer 48 in SRAM when you are currently computing layer 3, you have all the time in the world to move that to SRAM when you get to layer 45 or whatever the maths work out to be for your specific model.
There is just no need to have parameters or kv cache for layer 48 in SRAM when you are currently computing layer 3, you have all the time in the world to move that to SRAM when you get to layer 45 or whatever the maths work out to be for your specific model.