|
|
|
|
|
by BoorishBears
335 days ago
|
|
How is what supposed to scale? If you mean the serverless GPU offering, typically you set a cap for how many requests a single instance is meant to serve. Past that cap they'll spin up more instances. But if you mean rentals, scaling is on you. With LLM inference there's a regime where the model responses will slow down on a per-user basis while overall throughput goes up, but eventually you'll run out of headroom and need more servers. Another reason why generally speaking it's hard to compete with major providers on cost effectiveness. |
|
Thank you, this is what I wanted to know.
typically you set a cap for how many requests a single instance is meant to serve
If this is on us then we'd have to make sure whatever caps we set beat api providers. I don't know how easy that cap is to figure out.