Hacker News new | ask | show | jobs
by e4e78a06 1573 days ago
There are other costs that either make very wide OoO more difficult or more costly. x86 has a lot more flag-based instructions compared to Arm. That adds more dependencies that the reorder engine has to sort through. x86 variable length decoding takes log(n) in decode circuit depth, which either forces a longer pipeline or limits clocks. And obviously AVX512 units are just huge because a decision was made to make them the same latency as normal MUL/ADD/FMA. And x86 designs have to scale in clockspeed from server / tablet (~2.5GHz) to desktop and high perf laptop (5GHz+). That forces suboptimal designs like the 5 cycle L1 in Golden Cove. Meanwhile Apple has a 3 cycle 192kB L1.