Hacker News new | ask | show | jobs
by mhkool 1593 days ago
Since the performance for array sizes <L1-size and <L2-size is similar , I would like to see an attempt to improve B. B = L2-size / 2 / sizeof(int) - 16 might produce better results.

Note also that _mm_broadcast_ss() is faster on newer processors.