| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by andi999 1533 days ago
	How is "This is 40-50x faster than a naive C implementation of nested loops on my machine, " that possible/Do you know how this was achieved?

1 comments

mynameismon 1533 days ago

Not OP, but from a cursory glance at the code, it seems to be achieved with the combination of splitting the matrices into chunks to fit them in the L1/L2 caches (line 2 in the code), using tricks like switching indexes to achieve better cache locality, and using SIMD + Fused Multiply Add to further speedup things

link

andi999 1533 days ago

Thanks! Which file are you refering to?

link

mynameismon 1533 days ago

The code right above the assembly.

link