| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mukel 1048 days ago
	Author here: I implemented several versions of matmul with different unrolling schemes using the Vector API and I got a ~4X speedup with a single thread, but the speedup fades the more threads you add. I think that performance is constrained by memory bandwidth which is saturated with a small number of threads, regardless of vectorization.