| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jsnell 491 days ago

> IIUC to serve an LLM is to perform an O(n^2) computation on the model weights for every single character of user input.

The computations are not O(n^2) in terms of model weights (parameters), but linear. If it were quadratic, the number would be ludicrously large. Like, "it'll take thousands of years to process a single token" large.

(The classic transformers are quadratic on the context length, but that's a much smaller number. And it seems pretty obvious from the increases in context lengths that this is no longer the case in frontier models.)

> These models are 40+GB so that means I need to provision about 40GB RAM per concurrent user

The parameters are static, not mutated during the query. That memory can be shared between the concurrent users. The non-shared per-query memory usage is vastly smaller.

> How much would I have to charge for this?

Empirically, as little as 0.00001 cents per token.

For context, the Bing search API costs 2.5 cents per query.

1 comments

jcgrillo 491 days ago

Ah got it, that's more sensible. So is anyone making money with these things yet?

link