|
|
|
|
|
by namibj
2951 days ago
|
|
Be careful what they benchmark against. The Maxwell/Pascal GEMM routines are _very_ optimized, since Nervana Systems (in their infancy) made an assembler for them which they showed off/demonstrated by implementing GEMM (single-precision dense matrix*matrix multiplication). It was lacking like 2% from the theoretical performance, as predicted by the clockrate, but they had no idea why these 2% were missing. They hand-scheduled the instructions to get around the limited number of register banks, and the associated bank conflicts, a concept normal CPUs afaik only see with DRAM. At that time, the official GEMM, albeit hand-optimized, only reached like 80% of the theoretical value. |
|