|
|
|
|
|
by ack_complete
459 days ago
|
|
Yeah, the difference you're seeing is likely due to the inner loop doing so few iterations, in addition to not being unrolled. A hand-rolled version doing 63 iterations of an x4 unrolled loop should be able to saturate the execution core (it'll be bottlenecked by load throughput). But it'll be tough to get the compiler to generate that without intrinsics. |
|