| 1. Why would you WANT to hit 5+GHz when the downsides of exponential power take over? High clocks aren't a feature -- they are a cope. AMD/Intel maintain I-cache and maintain a uop cache kept in sync. Using a tiny part to pre-decode is different from a massive uop cache working as far in advance as possible in the hopes that your loops will keep you busy enough that your tiny 4-wide decoder doesn't become overwhelmed. 2. The float workload was always BS because you can't run nothing but floats. The integer workload had 22.1w total core power and 4.8w power for the decoder. 4.8/22.1 is 21.7%. Even the 1.8w float case is 8% of total core power. The only other argument would be that the study is wrong and 4.8w isn't actually just decoder power. 3. We're talking about worst cases here. Nothing stops ARM cores from creating a "work pool" of upcoming branches in priority order for them to decode if they run out of stuff on the main branch. This is the best of both worlds where you can be faster on the main branch AND still do the same branchy code trick too. 4. This is the tail wagging the dog (and something else if your numbers are correct). Complex x86 instructions have garbage performance, so they are avoided by the compiler. The problem is that you can't GUARANTEE those instructions will NEVER be used, so the mere specter of them forces complex algorithms all over the place where ARM can do more simple things. In any case, your numbers raise a VERY interesting question about x86 being RISC under the hood. Consider this. Say that we have 1024 bytes of ARM code (256 instructions). x86 is around 15% smaller (871.25 bytes) and with the longer 4.25 byte instruction average, x86 should have around 205 instructions. If ARM is generating 19.3% more uops than instructions, we have about 305 uops. x86 with just 4.7% more has 215 uops (the difference here is way outside any margins of error here). If both are doing the same work, x86 uops must be in the range of 30% more complex. Given the limits of what an ALU can accomplish, we can say with certainty that x86 uops are doing SOMETHING that isn't the RISC they claim to be doing. Perhaps one could claim that x86 is doing some more sophisticated instructions in hardware, but that's a claim that would need to be substantiated (I don't know what ISA instructions you have that give a 15% advantage being done in hardware, but aren't already in the ARM ISA and I don't see ARM refusing to add circuitry for current instructions to the ALU if it could reduce uops by 15% either). 8. https://en.wikipedia.org/wiki/Peephole_optimization The final optimization stage is basically heuristic find & replace. There could in theory be a mathematically provable "best instruction selection", but finding it would require trying every possible combination which isn't possible as long as P=NP holds true. My favorite absurdity of x86 (though hardly the only one) is padding. You want to align function calls at cacheline boundaries, but that means padding the previous cache line with NOPs. Those NOPs translate into uops though. Instead, you take your basic, short instruction and pad it with useless bytes. Add a couple useless bytes to a bunch of instructions and you now have the right length to push the function over to the cache boundary without adding any NOPs. But the issues go deeper. When do you use a REX prefix? You may want it so you can use 16 registers, but it also increases code size. REX2 with APX is going to increase this issue further where you must juggle when to use 8, 16, or 32 registers and when you should prefer the long REX2 because it has 3-register instructions. All kinds of weird tradeoffs exist throughout the system. Because the compilers optimize for the CPU and the CPU optimizes for the compiler, you can wind up in very weird places. In an ISA like ARM, there isn't any code density weirdness to consider. In fact, there's very little weirdness at all. Write it the intuitive way and you're pretty much guaranteed to get good performance. Total time to work on the compiler is a zero-sum game given the limited number of experts. If you have to deal with these kinds of heuristic headaches, there's something else you can't be working on. |
I'd call that more neat than absurd.
> You may want it so you can use 16 registers, but it also increases code size.
RISC-V has the exact same issue, some compressed instructions having only 3 bits for operand registers. And on x86 for 64-bit-operand instructions you need the REX prefix always anyways. And it's not that hard to pretty reasonably solve - just assign registers by their use count.
Peephole optimizations specifically here are basically irrelevant. Much of the complexity for x86 comes from just register allocation around destructive operations (though, that said, that does have rather wide-ranging implications). Other than that, there's really not much difference; all have the same general problems of moving instructions together for fusing, reordering to reduce register pressure vs putting parallelizable instructions nearer, rotating loops to reduce branches, branches vs branchless.