Hacker News new | ask | show | jobs
by ankit219 29 days ago
at a gross margin level, mostly no. if you include the cost of training a model as full R&D then possibly yes.

Batch size is what you should look at. If a cluster is running and processing one request, filling the batch has almost no marginal cost (kv cache creation/storage/fetch costs aside). But if the concurrent requests exceed batch size, one extra request would cost basically the rent cost of entire new cluster. APIs have the bursty nature so companies would plan to price it such that they are profitable / break even at 40%-50% utilization (% of filled batch for simplicity). so any extra request would not have the same cost as long as they are alongside an api request. you might think it degrades teh performance. easy: just assign a priority tier to api requests, and a lower tier to subscription requests.

its even more effective and powerful now that you have continuous batching. so likely if the api is being used, they are not eating any loss, let alone "big loss"