|
|
|
|
|
by e4e78a06
1573 days ago
|
|
There are other costs that either make very wide OoO more difficult or more costly. x86 has a lot more flag-based instructions compared to Arm. That adds more dependencies that the reorder engine has to sort through. x86 variable length decoding takes log(n) in decode circuit depth, which either forces a longer pipeline or limits clocks. And obviously AVX512 units are just huge because a decision was made to make them the same latency as normal MUL/ADD/FMA. And x86 designs have to scale in clockspeed from server / tablet (~2.5GHz) to desktop and high perf laptop (5GHz+). That forces suboptimal designs like the 5 cycle L1 in Golden Cove. Meanwhile Apple has a 3 cycle 192kB L1. |
|