|
|
|
|
|
by pama
17 days ago
|
|
Frankly, everyone in the industry knows. When people make these statements without additional clarity they always talk about API prices. You can look at the NVL72 specs and make estimates for electricity and ownership costs rather easily. Inference at data-center scale is dirt cheap, even with public codes using dynamo and sglang. The mystery is why the early misconceptions about inefficient inference persisted even after NVIDIA was very open about everything they did to help reduce costs dramatically in the last two years. |
|
Also, as the costs of running this stuff come down, the incentive to rent models goes down with them. Running local models has the benefit that you get to keep your data local, you can tune them to do what you like, and you're not subject to model or price changes down the road. This makes self hosting appealing both to individuals and companies. Currently, the barrier is in needing significant resources to run the models, but companies are already increasingly doing that with open models. And local inference that regular people can run is becoming a possibility as well.
While I'm sure there's always going to be a market for renting out models as a service, it may shrink significantly as the costs continue to come down.