|
|
|
|
|
by fossa1
347 days ago
|
|
This is a textbook case of micro-architectural reality beats theoretical elegance. It's fascinating how replacing 5 loads with 2 loads + 3 vextq_f32 intrinsics, which should reduce memory pressure, ends up being slower due to execution port contention and dependency chains. |
|