|
|
|
|
|
by storystarling
148 days ago
|
|
Do you have specific t/s numbers for those dense models? I'm curious just how severe the memory bandwidth bottleneck gets in practice. I'm not sure I agree on the cost aspect though. For high-volume production workloads the API bills scale linearly and can get painful fast. If you can amortize the hardware over a year and keep the data local for privacy, the math often works out in favor of self-hosting. |
|
On the same hardware gpt-oss-120b at 128k context, I fed it a longer version of the input (a whole novel, 97k tok), and it input processed at 1650 tok/s and output generated at 27 tok/s. Just fast enough IMO