Hacker News new | ask | show | jobs
by physguy1123 2891 days ago
You should try maintaining 4 independent sum variables and summing after the loop so there's no serializing dependency at all. Such a transformation in microbenchmarks is a fun trick to show the power of a proper OOO engine with pipelined instruction units. Assuming no memory problems, one should be able use issue-width*instruction latency independent sum streams without spending more time in the hot loop.

For what it's worth, the vmovdqa only has a 4-wide issue width if it is moving between registers, the memory load has a 2-wide issue width. Floating point adders themselves only have a 1-2 wide issue widths depending on your hardware so it doesn't really matter.