|
|
|
|
|
by sailingparrot
263 days ago
|
|
> I estimated that they needed over $100m of chips just to do Qwen 3 at max context size I will point out (again :)), that this math is completely wrong. There is no need (nor performance gains) to store the entire weights of the model in SRAM. You simply store n transformer blocks on-chip and then stream block l+n from external memory to on-chip when you start computing block l, this completely masks the communication time behind the compute time, and specifically does not require you to buy 100M$ worth of SRAM. This is standard stuff that is done routinely in many scenarios, e.g. FSDP. https://www.cerebras.ai/blog/cerebras-software-release-2.0-5... |
|