| HN Mirror

If you make careful calculations and estimate the theoretical margins for inference only of most of the big open models on openrouter, the margins are typically crazy high if the openrouter providers served at scale (north of 800% for most of the large models). The high cost probably reflects salaries, investments, and amortization of other expenses like free serving or occasional partial serving occupancy. Sometimes it is hard to keep uniform high load due to other preferences of users that dont get covered at any price, eg maximal context length (which is costing output performance), latency, and time for first token, but also things like privacy guarantees, or simply switching to the next best model quickly. I have always thought that centralized inference is the real goldmine of AI because you get so much value at scale for hardly any cost.