|
|
|
|
|
by BoorishBears
335 days ago
|
|
If you're making the effort-cost tradeoff like this, you typically choose a model, test a few inference stacks with prompts that are representative lengths for your use case, then benchmark. To benchmark you identify a maximum time to first token your users will accept, and minimum tokens per second they'll accept, then test how many concurrent requests you can handle before you exceed either limit. I can tell you, in my case the only reason why the pricing is somewhat competitive for self-hosting is that I'm aggressively seeking cheap rentals, have a use-case that requires very long prompts with few cache hits, and I've used extensive (and expensive) post-training to deploy smaller models than I'd otherwise need. |
|