Hacker News new | ask | show | jobs
by calebkaiser 2165 days ago
I maintain an open source ML infra project, where we've spent a ton of time on cost optimization for running GPU-intensive ML models, specifically on AWS: https://github.com/cortexlabs/cortex

If you've done zero optimization so far, there is likely some real low-hanging fruit:

1. If GPU instances are running up a huge EC2 bill, switch to spot instances (a g4dn.xlarge spot is $0.1578/hr in US West (Oregon) vs $0.526/hr on demand).

2. If inference costs are high, look into Inferentia ( https://docs.cortex.dev/deployments/inferentia ). For certain models, we've benchmarked over 4x improvements in efficiency. Additionally, autoscaling more conservatively and leveraging batch prediction wherever possible can make a real dent.

3. Finally, and likely the lowest hanging fruit of all, talk to your AWS rep. If your situation is dire, there's a very good chance they'll throw some credits your way while you figure things out.

If you're interested in trying Cortex out, AI Dungeon wrote a piece on how they used it to bring their spend down ~90%. For context, they serve a 5 GB GPT-2 model to thousands of players every day: https://medium.com/@aidungeon/how-we-scaled-ai-dungeon-2-to-...