| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by brrrrrm 1327 days ago
	Yep! This is an assumed optimization when it comes to modern linear algebra compilers. New primitives go way beyond FMAs: full matrix multiplies on nvidia/Intel and outer product accumulates on Apple silicon. It’s also expected that these are used nearly optimally (or you’ve got a bug).

1 comments

I am extremely familiar with how far these primitives go, ha. I develop kernels professionally for AWS ML accelerators.