|
|
|
|
|
by jcgrillo
490 days ago
|
|
> Cost as in, cost to you? Or cost to serve? This. IIUC to serve an LLM is to perform an O(n^2) computation on the model weights for every single character of user input. These models are 40+GB so that means I need to provision about 40GB RAM per concurrent user and perform hundreds of TB worth of computations per query. How much would I have to charge for this? Are there any products where the users would actually get enough value out of it to pay what it costs? Compare to the cost of a user session in a normal database backed web app. Even if that session fans out thousands of backend RPCs across a hundred services, each of those calls executes in milliseconds and requires only a fraction of the LLM's RAM. So I can support thousands of concurrent users per node instead of one. |
|
The computations are not O(n^2) in terms of model weights (parameters), but linear. If it were quadratic, the number would be ludicrously large. Like, "it'll take thousands of years to process a single token" large.
(The classic transformers are quadratic on the context length, but that's a much smaller number. And it seems pretty obvious from the increases in context lengths that this is no longer the case in frontier models.)
> These models are 40+GB so that means I need to provision about 40GB RAM per concurrent user
The parameters are static, not mutated during the query. That memory can be shared between the concurrent users. The non-shared per-query memory usage is vastly smaller.
> How much would I have to charge for this?
Empirically, as little as 0.00001 cents per token.
For context, the Bing search API costs 2.5 cents per query.