| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Const-me 590 days ago
	Couple years ago, I wanted about the same thing in HLSL language, for a Direct3D 11.0 compute shader. Here’s the fastest version I managed to make back then: https://github.com/Const-me/Cgml/blob/master/Mistral/Mistral... As you see, I have implemented 32×32 tiling, using thread groups of 32×8 threads, two groupshared buffers to load tiles of the input matrices, and I accumulate numbers into local variables, 32 / 8 = 4 accumulators per thread.

1 comments

lostmsu 590 days ago

What's the perf like?

link

Const-me 590 days ago

Sorry, I have not benchmarked against cuBLAS or Eigen or similar, I did that thing for ML inference.

I have implemented a profiler on top of D3D11_QUERY_TIMESTAMP and D3D11_QUERY_TIMESTAMP_DISJOINT queries, and tweaked the compute shader to minimize the time reported by these queries for my specific use case.

link