Hacker News new | ask | show | jobs
by mynameismon 1537 days ago
Not OP, but from a cursory glance at the code, it seems to be achieved with the combination of splitting the matrices into chunks to fit them in the L1/L2 caches (line 2 in the code), using tricks like switching indexes to achieve better cache locality, and using SIMD + Fused Multiply Add to further speedup things
1 comments

Thanks! Which file are you refering to?
The code right above the assembly.