|
|
|
|
|
by rfoo
848 days ago
|
|
idk For longer context to brute force it the problem is more on the memory side instead of the compute. Both bandwidth and capacity. We have more than enough compute for N^2 actually. The initial processing is dense, but is still largely bound by memory bw. Output is entirely bound by memory bw since you can't make your cores go brrr with only GEMV. And then you need capacity to keep KV "cache" [0] for the session. A single TPU v5e pod has only 4TB HBM, assuming pipeline parallel across multiple TPU pods isn't going to fly, I haven't run the numbers but I suspect you get batch=1/batch=2 inference at best. Which is prohibitively expensive. But again who knows, groq demonstrated a token-wise more expensive inference tech and got people wowed by pure speed. Maybe Google's similar move is long context. They have an additional advantage as they can have exclusive access to TPU so that before H200 ships they may be the only one who can serve a 1M token LLM to the public without breaking a bank. [0] "Cache" is a really poor name. It you don't do this you get O(n^3) which is not going to work at all. IMO it's wrong to name your intermediate state "cache" if removing it changes asymptotic complexity. |
|