Hacker News new | ask | show | jobs
by calebkaiser 1996 days ago
The price of GPU inference can be brutal, but there's a lot you can do on the infra side to improve it:

- Spot instances

- Aggressive autoscaling

- Micro batching

Can reduce inference compute spend by huge amounts (90% is not uncommon). ML, especially anything involving realtime inference, is an area where effective platform engineering makes a ridiculous difference even in the earliest days.

Source: I help maintain open source ML infra for GPU inference and think about compute spend way too much https://github.com/cortexlabs/cortex