Hacker News new | ask | show | jobs
by samhoss93 20 days ago
Agree. At high concurrency, you are better off spending the compute budget on parallel requests rather than draft prediction. The challenging part is that most deployment don't have static traffic profiles. A configuration that was right at launch may no longer be correct months later, and there is no signal that tells you when you have crossed the threshold.