Hacker News new | ask | show | jobs
by nmfisher 1996 days ago
From the about section:

> How much does maintaining the servers cost? > It depends on the amount of traffic, but the minimum baseline is around several thousands of US dollars every month. This is expected as inference is very GPU intensive and a sufficient number of instances need to be spun up to handle thousands of requests coming in every minute. Everything is paid out of pocket.

Wow, impressive commitment for something that's free.

4 comments

The price of GPU inference can be brutal, but there's a lot you can do on the infra side to improve it:

- Spot instances

- Aggressive autoscaling

- Micro batching

Can reduce inference compute spend by huge amounts (90% is not uncommon). ML, especially anything involving realtime inference, is an area where effective platform engineering makes a ridiculous difference even in the earliest days.

Source: I help maintain open source ML infra for GPU inference and think about compute spend way too much https://github.com/cortexlabs/cortex

Yeah, running anything related to AI involves GPU instances. An alternative is to point people to using Google Colab where you can get access to a GPU for free, but that's not a smooth end user experience for most folks.
> running anything related to AI involves GPU instances

This is not true. A _lot_ of AI applications use algorithms such as logistic regression or random forests and don’t need GPUs - partly, of course, because GPUs are so expensive and these approaches are good enough (or more than good enough) for many applications.

Whoops, sloppy generalization on my part. You're completely right of course, thanks! I've been focusing on deep learning a lot lately, to the point where AI has become an alias for those exciting new GPU-heavy techniques.
Out of curiosity, as I have no visibility about the infra actually required- but at that cost, would it not be easier to just have a machine under a desk somewhere?
Not for the kind of inference running here, I'd imagine.

There are few key reasons why most realtime inference is done on the cloud:

- Scale. Deep learning models especially tend to have poor latency, especially as they grow in size. As a result, you need to scale up replicas to meet demand at a way lower level of traffic than you do for a normal web app. At one point, AI Dungeon needed over 700 servers to support just thousands of concurrent players.

- Cost. Related to the above, GPUs are really expensive to buy. A g4dn.xlarge instance (the most popular AWS EC2 instance for GPU inference) is $0.526/hour on demand. To hit $3,000 per month in spend, you'd need to be running ~8 of them 24/7. Prices vary with purchasing GPUs, but you could expect 8 NVIDIA T4's to run around $20,000 at minimum, plus the cost of other components and maintainence. To be clear, that's very conservative--it's unlikely you'll get consistent traffic. What's more likely is you'll have some periods of very little traffic where you need one or two GPUs, and other high load periods where you'll need 10+.

3. Less universal of an issue, but the cloud gives you much better access to chips at lower switching costs. If NVIDIA releases a new GPU that's even better for inference, switching to it (once its available on your cloud) will be a tweak in your YAML. If you ever switch to ASICs like AWS's Inferentia or GCP's TPUs, which in many cases give way better performance and economics than GPUs, you'll also naturally have to be on their cloud.

However, there is a lot that can be done to lower the cost of inference even in the cloud. I listed some things in a comment higher up, but basically, there are some assumptions you can make with inference that allow you to optimize pretty hard on instance price and autoscaling behavior.

You just sort of assume that this is correct? The person[1] running this comes across as a severely unstable character, that number is probably hyperbole.

[1] https://twitter.com/fifteenai

Not a hyperbole – I can provide proof if you'd like.
Separate question - is this English only? It looks like you can feed in phonemes but I assume this has been trained with English audio.
Would you be willing to explain how you can justify offering this for free? I’ve subbed to the patreon, but that’s less than a drop in the bucket compared to the ~$10k you say this month will cost.
I’ve worked with deep learning models enough to know the cost of running GPU inference, and if the live queue stats published on the website are accurate, then thousands of dollars per month is certainly plausible.

I have no reason to disbelieve it.

It seems like one could get to those numbers pretty easily given the prices for GPU instances on AWS. Even just one decent-sized instance would be thousands of dollars per month.