| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by namibj 2951 days ago
	Be careful what they benchmark against. The Maxwell/Pascal GEMM routines are _very_ optimized, since Nervana Systems (in their infancy) made an assembler for them which they showed off/demonstrated by implementing GEMM (single-precision dense matrix*matrix multiplication). It was lacking like 2% from the theoretical performance, as predicted by the clockrate, but they had no idea why these 2% were missing. They hand-scheduled the instructions to get around the limited number of register banks, and the associated bank conflicts, a concept normal CPUs afaik only see with DRAM. At that time, the official GEMM, albeit hand-optimized, only reached like 80% of the theoretical value.

1 comments

crowwork 2951 days ago

the benchmarks are not about GEMM, but real-world deep learning workloads which could have very different characteristics from GEMM

link

namibj 2951 days ago

I just wanted to caution that one has to be careful what one is comparing against, as the libraries got significant speed improvements over time, without that being widely advertised. So it matters a lot if one compares this library against CuDNN from 1 month ago, or to CuDNN from 2 years ago. The latter is _much_ slower.

The GEMM example was just there as the details of the optimization have been published, unlike most other hand-tuned assembler routines for DNN workloads.

link

antinucleon 2951 days ago

CuDNN v7 was used in the experiments, in the experiments parts each comparison was listed with version or commit number.

link

namibj 2951 days ago

Well, I didn't RTFA. This was not meant to be specific for this article though.

link