| My guess is they are using tensor cores as they report FP16 throughput, but they seem to be measuring at batch size 1, which is hugely unfair to the GPUs. For inference workloads you usually batch incoming requests together and run once on GPU (though this increases latency). A latency/throughput tradeoff curve at different batch sizes would tell the whole story. Also, they are using INT8 on CPU and neglect to measure the same on GPU. All the GPU throughputs would 2x. tl;dr just use GPUs Edit: to the comments below, I agree low-latency can be important to some workloads, but that's exactly why I think we need to see a latency-throughput tradeoff curve. Unfortunately, I'm pretty sure that modern GPUs (A10/A30/A40/A100) basically dominate CPUs even when latency is constrained, and the MLPerf results give a good (fair!) comparison of this: https://developer.nvidia.com/blog/extending-nvidia-performan... The GPU throughputs are much, much higher than the CPU ones, and I don't think even NM's software can overcome this gap. Not to mention they degrade the model quality... The last question is whether CPUs are more cost-effective despite being slower, and the answer is still... no. The instances used in this blog post cost: - C5 CPU (c5.12xlarge): $2.04/hr
- T4 GPU (g4dn.2xlarge): $0.752/hr NM's best result @ batch-size-1 costs more, at lower throughput, at lower model quality, at ~same latency, than a last-gen GPU operating at half it's capacity. A new A10 GPU using INT8 will widen the perf/$ gap by another ~4x. Also full disclosure I don't work at NVIDIA or anything like that so I'm not trying to shill :) I just like ML hardware a lot and want to help people make fair comparisons. |
Hi ml_hardware, we report results for both throughput and latency in the blog. As you noted, the throughput performance for GPUs does beat out our implementations by a bit, but we did improve the throughput performance on CPUs by over 10x. Our goal is to enable better flexibility and performance for deployments through more commonly available CPU servers.
For throughput costs, this flexibility becomes essential. The user could scale down to even one core if they wanted to, with a much more significant increase in the cost performance. We walk through these comparisons in more depth in our YOLOv3 blog: https://neuralmagic.com/blog/benchmark-yolov3-on-cpus-with-d...
INT8 wasn't run on GPUs because we have issues with operator support on the conversion from PyTorch graphs to TensorRT (PyTorch currently doesn't have support for INT8 on GPU). We are actively working on this, though, so stay tuned as we run those comparisons!
The models we're shipping will see performance gains on the A100s, as well, due to their support for semi-structured sparsity. Note, though, A100s are priced more expensive than the commonly available V100s and T4s, which will need to be considered. We generally keep our current benchmarks limited to what is available in the top cloud services to represent what is deployable for most people on servers. This usability is why we don't consider ML Perf a great source for most users. ML Perf has done a great job in standardizing benchmarking and improving numbers across the industry. Still, the systems submitted are hyper-engineered for ML Perf numbers, and most customers cannot realize these numbers due to the cost involved.
Finally, note that the post-processing for these networks is currently limited to CPUs due to operator support. This limitation will become a bottleneck for most deployments (it already is for GPUs and us for the YOLOv5s numbers). We are actively working on speeding up the post-processing by leveraging the cache hierarchy in the CPUs through the DeepSparse engine, and are seeing promising early results. We'll be releasing those sometime soon in the future to show even better comparisons.