| HN Mirror

The blog is about training but the technique applies equally well to inference, just like FSDP and kv cache sharing are routinely done in inference on GPUs.

There is just no need to have parameters or kv cache for layer 48 in SRAM when you are currently computing layer 3, you have all the time in the world to move that to SRAM when you get to layer 45 or whatever the maths work out to be for your specific model.