|
|
|
|
|
by jamesblonde
3120 days ago
|
|
Very good analysis, and a correct conclusion that memory bandwidth is the bottleneck (at least for Matrix fused multiply-add intensive workloads - like feeedforward NNs and Convnets). We have done experiments on the 1080Ti (484 GB/s) and for 32-bit FP training (convnets on tensorflow), it is close in performance to the P100 (717 GB/s). The other point to add is that SIMD operation for GPUs is what gives them efficient batched reads from GPU memory for each operation. |
|
I can't say I'm an expert yet. But the more and more I read about highly optimized code on any platform, the more and more I realize that 90% of the problem is dealing with memory.
Virtually every optimization guide or highly-optimized code tutorial spends an enormous amount of time discussing memory problems. It seems like memory bandwidth is the singular thing that HPC coders think about the most.