|
|
|
|
|
by mikewarot
314 days ago
|
|
RAM is the reason LLMs are so power inefficient. Shuttling weights and results from RAM to compute and back for everything is where most of the power goes. It doesn't have to be that way. For a sufficiently large load, it makes sense to use reconfigurable hardware and bake in the constants and s dataflow at runtime. Think of it like using an array of FPGAs large enough to hold the whole model unwound, yet that could be configured in seconds at runtime. You'd get tokens at 100 MHz or more . You would think saving 95% or more on power and infrastructure for a given token rate would be worth it, especially when contemplating Trillion dollar outlays. |
|