|
|
|
|
|
by yorwba
236 days ago
|
|
Not really, Figure 1(a) of the paper says that the 17.7% are relative to a total of 30k GPUs (i.e. 5310 GPUs for handling those 1.35% of requests) and the reduction is measured in a smaller beta deployment with only 47 different models (vs. the 733 "cold" models overall.) Naïve extrapolation by model count suggests they would need 3321 GPUs to serve all cold models, a 37.5% reduction to before. (Or 6.6% reduction of the full 30k-GPU cluster.) |
|
"A paper presented at SOSP 2025 details how token-level scheduling helped one GPU serve multiple LLMs, reducing demand from 1,192 to 213 H20s."
Which, if you scale it, matches the GPs statement.