Hacker News new | ask | show | jobs
by ivape 335 days ago
Past that cap they'll spin up more instances.

Thank you, this is what I wanted to know.

typically you set a cap for how many requests a single instance is meant to serve

If this is on us then we'd have to make sure whatever caps we set beat api providers. I don't know how easy that cap is to figure out.

1 comments

If you're making the effort-cost tradeoff like this, you typically choose a model, test a few inference stacks with prompts that are representative lengths for your use case, then benchmark.

To benchmark you identify a maximum time to first token your users will accept, and minimum tokens per second they'll accept, then test how many concurrent requests you can handle before you exceed either limit.

I can tell you, in my case the only reason why the pricing is somewhat competitive for self-hosting is that I'm aggressively seeking cheap rentals, have a use-case that requires very long prompts with few cache hits, and I've used extensive (and expensive) post-training to deploy smaller models than I'd otherwise need.