Hacker News new | ask | show | jobs
by 37ef_ced3 1739 days ago
Batch size 1 improves latency, especially for businesses/services with fewer users. Latency matters.

Also, your CPU cost numbers are way off, using an expensive provider like AWS instead of, say, Vultr (https://www.vultr.com)

And many businesses/services can't saturate the hardware you describe. It's just too much compute power. With CPUs you can scale down to fit your actual needs: all the way down to a single AVX-512 core doing maybe 24 inferences per second (costing a few dollars PER MONTH).

1 comments

I was providing costs for the exact instance types that NeuralMagic used in their blog post, if we’re allowed to change that then I can also find cheaper GPU providers.

I can agree with you that on super, super small inference deployments, maybe you can lower monthly spend by using CPUs. But i must ask.. who is the target customer that is both spending <$100 / month and also trying to optimize this? I feel like big players will have big workloads that will be most cost-effective on GPUs.