|
|
|
|
|
by QuadmasterXLII
848 days ago
|
|
10 million means a forward pass is 100 trillion vector vector products. A single A6000 can do 38 trillion float-float products a second. I think their vectors are ~4000 elements long? So the question is, would the google you know devote 12,000 gpus for one second to help a blogger find a line about jewish softball, in the hopes that it would boost PR? My guess is yes tbh |
|
For longer context to brute force it the problem is more on the memory side instead of the compute. Both bandwidth and capacity. We have more than enough compute for N^2 actually. The initial processing is dense, but is still largely bound by memory bw. Output is entirely bound by memory bw since you can't make your cores go brrr with only GEMV. And then you need capacity to keep KV "cache" [0] for the session. A single TPU v5e pod has only 4TB HBM, assuming pipeline parallel across multiple TPU pods isn't going to fly, I haven't run the numbers but I suspect you get batch=1/batch=2 inference at best. Which is prohibitively expensive. But again who knows, groq demonstrated a token-wise more expensive inference tech and got people wowed by pure speed. Maybe Google's similar move is long context. They have an additional advantage as they can have exclusive access to TPU so that before H200 ships they may be the only one who can serve a 1M token LLM to the public without breaking a bank.
[0] "Cache" is a really poor name. It you don't do this you get O(n^3) which is not going to work at all. IMO it's wrong to name your intermediate state "cache" if removing it changes asymptotic complexity.