|
|
|
|
|
by mgfist
301 days ago
|
|
Running a local model is not an apples comparison. Yes, if you run a small model 24/7 without a care for output latency and utilization is completely static with no bursts, then it can look cheap. But most people want output now, not in 10 hours. And they want it from the best models. And they want large context windows. And when you combine that with serving millions of users, it gets complicated and expensive. |
|
> But most people want output now, not in 10 hours.
At 65t/s, that's 2.5 million tokens output.