| > IIUC to serve an LLM is to perform an O(n^2) computation on the model weights for every single character of user input. The computations are not O(n^2) in terms of model weights (parameters), but linear. If it were quadratic, the number would be ludicrously large. Like, "it'll take thousands of years to process a single token" large. (The classic transformers are quadratic on the context length, but that's a much smaller number. And it seems pretty obvious from the increases in context lengths that this is no longer the case in frontier models.) > These models are 40+GB so that means I need to provision about 40GB RAM per concurrent user The parameters are static, not mutated during the query. That memory can be shared between the concurrent users. The non-shared per-query memory usage is vastly smaller. > How much would I have to charge for this? Empirically, as little as 0.00001 cents per token. For context, the Bing search API costs 2.5 cents per query. |