If you mean the serverless GPU offering, typically you set a cap for how many requests a single instance is meant to serve. Past that cap they'll spin up more instances.
But if you mean rentals, scaling is on you. With LLM inference there's a regime where the model responses will slow down on a per-user basis while overall throughput goes up, but eventually you'll run out of headroom and need more servers.
Another reason why generally speaking it's hard to compete with major providers on cost effectiveness.
If you're making the effort-cost tradeoff like this, you typically choose a model, test a few inference stacks with prompts that are representative lengths for your use case, then benchmark.
To benchmark you identify a maximum time to first token your users will accept, and minimum tokens per second they'll accept, then test how many concurrent requests you can handle before you exceed either limit.
I can tell you, in my case the only reason why the pricing is somewhat competitive for self-hosting is that I'm aggressively seeking cheap rentals, have a use-case that requires very long prompts with few cache hits, and I've used extensive (and expensive) post-training to deploy smaller models than I'd otherwise need.
If you mean the serverless GPU offering, typically you set a cap for how many requests a single instance is meant to serve. Past that cap they'll spin up more instances.
But if you mean rentals, scaling is on you. With LLM inference there's a regime where the model responses will slow down on a per-user basis while overall throughput goes up, but eventually you'll run out of headroom and need more servers.
Another reason why generally speaking it's hard to compete with major providers on cost effectiveness.