|
|
|
|
|
by 37ef_ced3
1739 days ago
|
|
Batch size 1 improves latency, especially for businesses/services with fewer users. Latency matters. Also, your CPU cost numbers are way off, using an expensive provider like AWS instead of, say, Vultr (https://www.vultr.com) And many businesses/services can't saturate the hardware you describe. It's just too much compute power. With CPUs you can scale down to fit your actual needs: all the way down to a single AVX-512 core doing maybe 24 inferences per second (costing a few dollars PER MONTH). |
|
I can agree with you that on super, super small inference deployments, maybe you can lower monthly spend by using CPUs. But i must ask.. who is the target customer that is both spending <$100 / month and also trying to optimize this? I feel like big players will have big workloads that will be most cost-effective on GPUs.