Y
Hacker News
new
|
ask
|
show
|
jobs
by
marshallward
806 days ago
I'm sure there's more to it, but just comparing the profile output shows aggressive use of prefetch and broadcast instructions.
1 comments
jart
806 days ago
BLIS does that in their kernels. I've tried doing that but was never able to get something better than half as good as MKL. The BLIS technique of tiling across k also requires atomics or an array of locks to write output.
link