| > A 4090 amortized over 4 years, working days & hours, is 20 cents per working hour; But that's not how it works: you need to have enough of it to accommodate for peek usage, but a good fraction of that isn't going to be running most of the time. You'd end up with a cost that's not too far from what Cloud providers are offering, which is a roughly 3 times that price. And you need to pay for the whole server hosting these GPUs (this less of a factor when you're using big GPUs like H100, but if you want to stick with consumer-grade GPUs, then the host is still a non-trivial fraction of the cost, and your supporting a server for a small bunch of concurrent users, which means your infra team is going to work with a massive pool of servers very quickly, with all the associated costs). > It's less than a penny per hour per concurrent on a task like this. It's still two orders of magnitude more expansive than any other SaaS business. > Hopefully you're using LLM to deliver value that's worth more than a penny per hour of the people using it. Maybe, but then again you're trying to build a service that has to add much more value than what the typical SaaS start-up provide. Also regarding this: > - Most apps are not non-stop token generation for concurrent users-- ChatGPT's duty cycle at this is very low. ChatGPT is mostly being used by people who use it a few minutes per day, which is a nice place to be, but: - this market is already taken by them, so your startup isn't gonna do the same. - when you start integrating LLMs in tools you use routinely (an IDE being the typical example, then the token generation amount skyrockets). |
Really? Some SaaS businesses have users doing things that generate tens of thousands of IOs per user request across spinning storage, or even far more.
> ChatGPT is mostly being used by people who use it a few minutes per day, which is a nice place to be, but:
I think you basically completely misunderstood everything I said. Here, the point was that someone using it is generating tokens a very large proportion of the time they're sitting in front of the service compared to most use cases-- but it's still only like 20% of the time.
We all have a pretty good understanding of the tradeoffs between owning hardware vs. elastic usage of a utility. We know that "peek usage" [sic] is higher than average (which is why there's a duty cycle correction in the calculation in the first place).
> - when you start integrating LLMs in tools you use routinely (an IDE being the typical example, then the token generation amount skyrockets).
It all depends. The system I just built and deployed does not need to be immediately responsive to end-users (users can tolerate a delay of a couple of minutes), with a few thousand tokens per user per week, and usage smeared pretty well over a several hour per day window. There's a lot of reasons (beyond economics) why moving it to a consumer GPU is attractive, but it won't be happy with a 1B parameter model.