|
|
|
|
|
by ivape
335 days ago
|
|
Past that cap they'll spin up more instances. Thank you, this is what I wanted to know. typically you set a cap for how many requests a single instance is meant to serve If this is on us then we'd have to make sure whatever caps we set beat api providers. I don't know how easy that cap is to figure out. |
|
To benchmark you identify a maximum time to first token your users will accept, and minimum tokens per second they'll accept, then test how many concurrent requests you can handle before you exceed either limit.
I can tell you, in my case the only reason why the pricing is somewhat competitive for self-hosting is that I'm aggressively seeking cheap rentals, have a use-case that requires very long prompts with few cache hits, and I've used extensive (and expensive) post-training to deploy smaller models than I'd otherwise need.