| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mynameismon 1537 days ago
	Not OP, but from a cursory glance at the code, it seems to be achieved with the combination of splitting the matrices into chunks to fit them in the L1/L2 caches (line 2 in the code), using tricks like switching indexes to achieve better cache locality, and using SIMD + Fused Multiply Add to further speedup things

1 comments

andi999 1536 days ago

Thanks! Which file are you refering to?

mynameismon 1536 days ago

The code right above the assembly.