Hacker News new | ask | show | jobs
by dfbrown 3540 days ago
I would try unrolling 2-4 iterations of the loop. Multiple sequential loads isn't much slower than a single load, so batching your loads and stores together will let you do more arithmetic operations for each time you hit memory.