|
|
|
|
|
by jdcasale
146 days ago
|
|
The math is obvious on this one. It's super well-documented that model performance on complex tasks scales (to some asymptote) with the amount of inference-time compute allocated. LLM providers must dynamically scale inference-time compute based on current load because they have limited compute. Thus it's impossible for traffic spikes _not_ to cause some degradations in model performance (at least until/unless they acquire enough compute to saturate that asymptotic curve for every request under all demand conditions -- it does not seem plausible that they are anywhere close to this) |
|
They either overprovision the server during low demand or they might dynamically provision servers based on load.