Hacker News new | ask | show | jobs
by wtallis 677 days ago
The CPU core's instruction set has no influence on how well the chip as a whole manages power when not executing instructions.
2 comments

That is fair, I was taught that decoders for x86 are less efficient and more power hungry than RISC ISAs because of their variable length instructions.

I remember being told (and it might be wrong) that ARM can decode multiple instructions in parallel because the CPU knows where the next instruction starts, but for x86, you'd have to decode the instructions in order.

That seems to not matter much nowadays. There's another great(according to my untrained eye) writeup of the lack of importance on chips and cheese.

https://chipsandcheese.com/2021/07/13/arm-or-x86-isa-doesnt-...

The various mentioned power consumption amounts are 4-10% per-core, or 0.5-6% of package (with the caveat of running with micro-op cache off) for Zen 2, and 3-10% for Haswell. That's not massive, but is still far from what I'd consider insignificant; it could give leeway for an extra core or some improved ALUs; or, even, depending on the benchmark, is the difference between Zen 4 and Zen 5 (making the false assumption of a linear relation between power and performance, at least), which'd essentially be a "free" generational improvement. Of course the reality is gonna be more modest than that, but it's not nothing.
You missed the part where they mention ARM ends up implementing the same thing to go fast.

The point is processors are either slow and efficient, or fast and inefficient. It's just a tradeoff along the curve.

ARM doesn't need the variable-length instruction decoding though, which on x86 essentially means that the decoder has to attempt to decode at every single byte offset for the start of the pipeline, wasting computation.

Indeed pretty much any architecture can benefit from some form of op cache, but less of a need for it means its size can be reduced (and savings spent in more useful ways), and you'll still need actual decoding at some point anyway (and, depending on the code footprint, may need it a lot).

More generally, throwing silicon at a problem is, quite obviously, a more expensive solution than not having the problem in the first place.

x86 processors simply run a instruction length predictor the same way they do it for branch prediction. That turns the problem into something that can be tuned. Instead of having to decode the instruction at every byte offset, you can simply decide to optimize for the 99% case with a slow path for rare combinations.
But bigger fixed-length instructions mean more I$ pressure, right?
That stuff is WAY out-of-date and was flatly wrong when it was published.

A715 cut decoder size a whopping 75% by dropping the more CISC 32-bit stuff and completely eliminated the uop cache too. Losing all that decode, cache, and cache controllers means a big reduction in power consumption (decoders are basically always on). All of ARM's latest CPU designs have eliminated uop cache for this same reason.

At the time of publication, we already knew that M1 (already out for nearly a year) was the highest IPC chip ever made and did not use a uop cache.

Clam makes some serious technical mistakes in that article and some info is outdated.

1. His claim that "ARM decoder is complex too" was wrong at the time (M1 being an obvious example) and has been proven more wrong since publication. ARM dropped the uop cache as soon as they dropped support for their very CISC-y 32-bit catastrophe. They bragged that this coincided with a whopping 75% reduction in decoder size for their A715 (while INCREASING from 4 decoders to 5) and this was almost single-handedly responsible for the reduced power consumption of that chip (as all the other changes were comparatively minor). NONE of the current-gen cores from ARM, Apple, or Qualcomm use uop cache eliminating these power-hungry cache and cache controllers.

2. The paper[0] he quotes has a stupid conclusion. They show integer workloads using a massive 22% of total core power on the decoder and even their fake float workload showed 8% of total core power. Realize that a study[1] of the entire Ubuntu package repo showed that just 12 int/ALU instructions made up 89% of all code with float/SIMD being in the very low single digits of use.

3. x86 decoder situation has gotten worse. Because adding extra decoders is exponentially complex, they decided to spend massive amounts of transistors on multiple decoder blocks working on various speculated branches. Setting aside that this penalizes unrolled code (where they may have just 3-4 decoders while modern ARM will have 10+ decoders), the setup for this is incredibly complex and man-year intensive.

4. "ARM decodes into uops too" is a false equivalency. The uops used by ARM are extremely close to the original instructions as shown by them being able to easily eliminate the uop cache. x86 has a much harder job here mapping a small set of instructions onto a large set.

5. "ARM is bloated too". ARM redid their entire ISA to eliminate bloat. If ISA didn't actually matter, why would they do this?

6. "RISC-V will become bloated too" is an appeal to ignorance. x86 has SEVENTEEN major SIMD extensions excluding the dozen or so AVX-512 extensions all with various incompatibilities and issues. This is because nobody knew what SIMD should look like. We know now and RISC-V won't be making that mistake. x86 has useless stuff like BCD instructions using up precious small instruction space because they didn't know. RISC-V won't do this either. With 50+ years of figuring the basics out, RISC-V won't be making any major mistakes on the most important stuff.

7. Omitting complexity. A bloated, ancient codebase takes forever to do anything with. A bloated, ancient ISA takes forever to do anything with. If ARM and Intel both put X dollars into a new CPU design, Intel is going to spend 20-30% or maybe even more of their budget on devs spending time chasing edge cases and testers to test al those edge cases. Meanwhile, ARM is going to spend that 20-30% of their budget on increasing performance. All other things equal, the ARM chip will be better at any given design price point.

8. Compilers matter. Spitting out fast x86 code is incredibly hard because there are so many variations on how to do things each with their own tradeoffs (that conflate in weird ways with the tradeoffs of nearby instructions). We do peephole heuristic optimizations because provably fast would take centuries. RISC-V and ARM both make it far easier for compiler writers because there's usually just one option rather than many options and that one option is going to be fast.

[0] https://www.usenix.org/system/files/conference/cooldc16/cool...

[1] https://oscarlab.github.io/papers/instrpop-systor19.pdf

One more: there's more to an ISA than just the instructions; there's semantic differences as well. x86 dates to a time before out-of-order execution, caches, and multi-core systems, so it has an extremely strict memory model that does not reflect modern hardware -- the only memory-reordering optimization permitted by the ISA is store buffering.

Modern x86 processors will actually perform speculative weak memory accesses in order to try to work around this memory model, flushing the pipeline if it turns out a memory-ordering guarantee was violated in a way that became visible to another core -- but this has complexity and performance impacts, especially when applications make heavy use of atomic operations and/or communication between threads.

Simple atomic operations can be an order of magnitude faster on ARMv8 vs x86: https://web.archive.org/web/20220129144454/https://twitter.c...

"the only memory-reordering optimization permitted by the ISA is store buffering."

I think this is a mischaracterization of TSO. TSO only dictates the store ordering to other entities in the system, the individual cores are fully capable of using the results of stores that are not yet visible for their own OoO purposes as long as the dataflow dependencies are correctly solved. The complexities of the read/write bypassing is simply to clarify correct program order.

And this is why the TSO/non TSO mode on something like the apple cores doesn't seem to make a huge difference, particularly if one assumes that the core is aggressively optimized for the arm memory model, and the TSO buffering/ordering is not a critical optimization point.

Put another way, a core designed to track store ordering utilizing some kind of writeback merging is going to be fully capable of executing just as aggressively OoO and holding back or buffering the visibility of completed stores until earlier stores complete. In fact for multithreaded lock-free code the lack of explicit write fencing is likely a performance gain for very carefully optimized code in most cases. A core which can pipeline and execute multiple outstanding store fences is going to look very similar to one that implements TSO.

Yes, and Apple added this memory model to their ARM implementation so Rosetta2 would work well.
Some notes:

3: I don't think more decoders should be exponentially more complex, or even polynomial; I think O(n log n) should suffice. It just has a hilarious constant factor due to the lookup tables and logic needed, and that log factor also impacts the critical path length, i.e. pipeline length, i.e. mispredict penalty. Of note is that x86's variable-length instructions aren't even particularly good at code size.

Golden Cove (~1y after M1) has 6-wide decode, which is probably reasonably near M1's 8-wide given x86's complex instructions (mainly free single-use loads). [EDIT: actually, no, chipsandcheese's diagram shows it only moving 6 micro-ops per cycle to reorder buffer, even out of the micro-op cache. Despite having 8/cycle retire. Weird.]

6: The count of extensions is a very bad way to measure things; RISC-V will beat everything in that in no time, if not already. The main things that matter are ≤SSE4.2 (uses same instruction encoding as scalar code); AVX1/2 (VEX prefix); and AVX-512 (EVEX). The actual instruction opcodes are shared across those. But three encoding modes (plus the three different lengths of the legacy encoding) is still bad (and APX adds another two onto this) and the SSE-to-AVX transition thing is sad.

RISC-V already has two completely separate solutions for SIMD - v (aka RVV, i.e. the interesting scalable one) and p (a simpler thing that works in GPRs; largely not being worked on but there's still some activity). And if one wants to count extensions, there are already a dozen for RVV (never mind its embedded subsets) - Zvfh, Zvfhmin, Zvfbfwma, Zvfbfmin, Zvbb, Zvkb, Zvbc, Zvkg, Zvkned, Zvknhb, Zvknha, Zvksed, Zvksh; though, granted, those work better together than, say, SSE and AVX (but on x86 there's no reason to mix them anyway).

And RVV might get multiple instruction encoding forms too - the current 32-bit one is forced into allowing using only one register for masking due to lack of encoding space, and a potential 48-bit and/or 64-bit instruction encoding extension has been discussed quite a bit.

8: RISC-V RVV can be pretty problematic for some things if compiling without a specific target architecture, as the scalability means that different implementations can have good reason to have wildly different relative instruction performance (perhaps most significant being in-register gather (aka shuffle) vs arithmetic vs indexed load from memory).

3. You can look up the papers released in the late 90s on the topic. If it was O(n log n), going bigger than 4 full decoders would be pretty easy.

6. Not all of those SIMD sets are compatible with each other. Some (eg, SSE4a) wound up casualties of the Intel v AMD war. It's so bad that the Intel AVX10 proposal is mostly about trying to unify their latest stuff into something more cohesive. If you try to code this stuff by hand, it's an absolute mess.

The P proposal is basically DOA. It could happen, but nobody's interested at this point. Just like the B proposal subsumed a bunch of ridiculously small extensions, I expect a new V proposal to simply unify these. As you point out, there isn't really any conflict between these tiny instruction releases.

There is discussion around the 48-bit format (the bits have been reserved for years now), but there are a couple different proposals (personally, I think 64-bit only with the ability to put multiple instructions inside is better, but that's another topic). Most likely, a 48-bit format does NOT do multiple encoding, but instead does a superset of encodings (just like how every 16-bit instruction expands into a 32-bit instruction). They need/want 48-bits to allow 4-address instructions too, so I'd imagine it's coming sooner or later.

Either way, the length encoding is easy to work with compared to x86 where you must check half the bits in half the bytes before you can be sure about how long your instruction really is.

8. There could be some variance, but x86 has this issue too and SO many more besides.

The trend seems to be going towards multiple decoder complexes. Recent designs from AMD and Intel do this.

It makes sense to me: if the distance between branches is small, a 10-wide decode may be wasted anyway. Better to decode multiple basic blocks in parallel

I know the E-cores (gracemont, crestmont, skymont) have the multi-decoder setup; the first couple search results don't show Golden Cove being the same. Do you have some reference for that?

6. Ah yeah the funky SSE4a thing. RISC-V has its own similar but worse thing with RVV0.7.1 / xtheadvector already though, and it can be basically guaranteed that there will be tons of one-off vendor extensions, including vector ones, given that anyone can make such.

8. RVV's vrgather is extremely bad at this, but is very important for a bunch of non-trivial things; existing RVV1.0 hardware has it at O(LMUL^2), e.g. BPI-F3 takes 256 cycles for LMUL=8[1]. But some hypothetical future hardware could do it at O(LMUL) for non-worst-case indices, thus massively changing tradeoffs. So far the compiler approaches are to just not do high LMUL when vrgather is needed (potentially leaving free perf on the table), or using indexed loads (potentially significantly worse).

Whereas x86 and ARM SIMD perf variance is very tiny; basically everything is pretty proportional everywhere, with maybe the exception of very old atom cores. There'll be some differences of 2x up or down of throughput of instruction classes, but it's generally not so bad as to make way for alternative approaches to be better.

[1]: https://camel-cdr.github.io/rvv-bench-results/bpi_f3/index.h...

Expanding on 3: I think it ends up at O(n^2 * log n) transistors, O(log n) critical path (not sure on routing or what fan-out issues might there be).

Basically: determine end of instruction at each byte (trivial but expensive). Determine end of two instructions at each byte via end2[i]=end[end[i]]. Then end4[i]=end2[end2[i]], etc, log times.

That's essentially log(n) shuffles. With 32-byte/cycle decode that's roughy five 'vpermb ymm's, which is rather expensive (though various forms of shortcuts should exist - for the larger layers direct chasing is probably feasible, and for the smaller ones some special-casing of single-byte instructions could work).

And, actually, given the mention of O(log n)-transistor shuffles at http://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardo..., it might even just be O(n * log^2(n)) transistors.

Importantly, x86 itself plays no part in the non-trivial part. It applies equqlly to the RISC-V compressed extension, just with a smaller constant.

Determining the end of a RISC-V instruction requires checking two bits and you have the knowledge that no instruction exceeds 4 bytes or uses less than 2 bytes.

x86 requires checking for a REX, REX2, VEX, EVEX, etc prefix. Then you must check for either 1 or 2 instruction bytes. Then you must check for the existence of a register byte, how many immediate byte(s), and if you use a scaled index byte. Then if a register byte exists, you must check it for any displacement bytes to get your final instruction length total.

RISC-V starts with a small complexity then multiplies it by a small amount. x86 starts with a high complexity then multiplies it by a big amount. The real world difference here is large.

As I pointed out elsewhere ARM's A715 dropped support for aarch32 (which is still far easier to decode than x86) and cut decoder size by 75% while increasing raw decoder count by 20%. The decoder penalties of bad ISA design extend beyond finding instruction boundaries.

> With 50+ years of figuring the basics out, RISC-V won't be making any major mistakes on the most important stuff.

RVV does have significant departures from prior work, and some of them are difficult to understand:

- the whole concept of avl, which adds complexity in many areas including reg renaming. From where I sit, we could just use masks instead.

- mask bits reside in the lower bits of a vector, so we either require tons of lane-crossing wires or some kind of caching.

- global state LMUL/SEW makes things hard for compilers and OoO.

- LMUL is cool but I imagine it's not fun to implement reductions, and vrgather.

How does avl affect register renaming? (there's the edge-case of vl=0 that is horrifically stupid (which is by itself a mistake for which I have seen no justification but whatever) but that's probably not what you're thinking of?) Agnostic mode makes it pretty simple for hardware to do whatever it wants.

Over masks it has the benefit of allowing simple hardware short-circuiting, though I'd imagine it'd be cheap enough to 'or' together mask bit groups to short-circuit on (and would also have the benefit of better masked throughput)

Cray-1 (1976) had VL, though, granted, that's a pretty long span of no-VL until RVV.

Was thinking of a shorter avl producing partial results merged into another reg. Something like a += b; a[0] += c[0]. Without avl we'd just have a write-after-write, but with it, we now have an additional input, and whether this happens depends on global state (VL).

Espasa discusses this around 6:45 of https://www.youtube.com/watch?v=WzID6kk8RNs.

Agree agnostic would help, but the machine also has to handle SW asking for mask/tail unchanged, right?

Some notes: 1. Consider M1's 8-wide decoder hit the 5+ GHz clock speeds that Intel Golden Cove's decoder can. More complex logic with more delays is harder to clock up. Of course M1 may be held back by another critical path, but it's interesting that no one has managed to get a 8-wide Arm decoder running at the clock speeds that Zen 3/4 and Golden Cove can.

A715's slides say the L1 icache gains uop cache features including caching fusion cases. Likely it's a predecode scheme much like AMD K10, just more aggressive with what's in the predecode stage. Arm has been doing predecode (moving some stages to the L1i fill path rather than the hotter L1i hit path) to mitigate decode costs for a long time. Mitigating decode costs again with a uop cache never made much sense especially considering their low clock speeds. Picking one solution or the other is a good move, as Intel/AMD have done. Arm picked predecode for A715.

2. The paper does not say 22% of core power is in the decoders. It does say core power is ~22% of package power. Wrong figure? Also, can you determine if the decoder power situation is different on Arm cores? I haven't seen any studies on that.

3. Multiple decoder blocks doesn't penalize decoder blocks once the load balancing is done right, which Gracemont did. And you have to massively unroll a loop to screw up Tremont anyway. Conversely, decode blocks may lose less throughput with branchy code. Consider that decode slots after a taken branch are wasted, and clustered decode gets around that. Intel stated they preferred 3x3 over 2x4 for that reason.

4. "uops used by ARM are extremely close to the original instructions" It's the same on x86, micro-op count is nearly equal to instruction count. It's helpful to gather data to substantiate your conclusions. For example, on Zen 4 and libx264 video encoding, there's ~4.7% more micro-ops than instructions. Neoverse V2 retires ~19.3% more micro-ops than instructions in the same workload. Ofc it varies by workload. It's even possible to get negative micro-op expansion on both architectures if you hit branch fusion cases enough.

8. You also have to tell your ARM compiler which of the dozen or so ISA extension levels you want to target (see https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html#inde...). It's not one option by any means. Not sure what you mean by "peephole heuristic optimizations", but people certainly micro-optimize for both arm and x86. For arm, see https://github.com/dotnet/runtime/pull/106191/files as an example. Of course optimizations will vary for different ISAs and microarchitectures. x86 is more widely used in performance critical applications and so there's been more research on optimizing for x86 architectures, but that doesn't mean Arm's cores won't benefit from similar optimization attention should they be pressed into a performance critical role.

> Not sure what you mean by "peephole heuristic optimizations"

Post-emit or within-emit stage optimization where a sequence of instructions is replaced with a more efficient shorter variant.

Think replacing pairs of ldr and str with ldp and stp, changing ldr and increment with ldr with post-index addressing mode, replacing address calculation before atomic load with atomic load with addressing mode (I think it was in ARMv8.3-a?).

The "heuristic" here might be possibly related to additional analysis when doing such optimizations.

For example, previously mentioned ldr, ldr -> ldp (or stp) optimization is not always a win. During work on .NET 9, there was a change[0] that improved load and store reordering to make it more likely that simple consecutive loads and stores are merged on ARM64. However, this change caused regressions in various hot paths because, for example, previously matched ldr w0, [addr], ldr w1, [addr+4] -> modify w0 -> str w0, [addr] pair got replaced with ldp w0, w1, [add] -> modify w0, str w0 [addr].

Turns out this kind of merging defeated store forwarding on Firestorm (and newer) as well as other ARM cores. The regression was subsequently fixed[1], but I think the parent comment author may have had scenarios like these in mind.

[0]: https://github.com/dotnet/runtime/pull/92768

[1]: https://github.com/dotnet/runtime/pull/105695

1. Why would you WANT to hit 5+GHz when the downsides of exponential power take over? High clocks aren't a feature -- they are a cope.

AMD/Intel maintain I-cache and maintain a uop cache kept in sync. Using a tiny part to pre-decode is different from a massive uop cache working as far in advance as possible in the hopes that your loops will keep you busy enough that your tiny 4-wide decoder doesn't become overwhelmed.

2. The float workload was always BS because you can't run nothing but floats. The integer workload had 22.1w total core power and 4.8w power for the decoder. 4.8/22.1 is 21.7%. Even the 1.8w float case is 8% of total core power. The only other argument would be that the study is wrong and 4.8w isn't actually just decoder power.

3. We're talking about worst cases here. Nothing stops ARM cores from creating a "work pool" of upcoming branches in priority order for them to decode if they run out of stuff on the main branch. This is the best of both worlds where you can be faster on the main branch AND still do the same branchy code trick too.

4. This is the tail wagging the dog (and something else if your numbers are correct). Complex x86 instructions have garbage performance, so they are avoided by the compiler. The problem is that you can't GUARANTEE those instructions will NEVER be used, so the mere specter of them forces complex algorithms all over the place where ARM can do more simple things.

In any case, your numbers raise a VERY interesting question about x86 being RISC under the hood.

Consider this. Say that we have 1024 bytes of ARM code (256 instructions). x86 is around 15% smaller (871.25 bytes) and with the longer 4.25 byte instruction average, x86 should have around 205 instructions. If ARM is generating 19.3% more uops than instructions, we have about 305 uops. x86 with just 4.7% more has 215 uops (the difference here is way outside any margins of error here).

If both are doing the same work, x86 uops must be in the range of 30% more complex. Given the limits of what an ALU can accomplish, we can say with certainty that x86 uops are doing SOMETHING that isn't the RISC they claim to be doing. Perhaps one could claim that x86 is doing some more sophisticated instructions in hardware, but that's a claim that would need to be substantiated (I don't know what ISA instructions you have that give a 15% advantage being done in hardware, but aren't already in the ARM ISA and I don't see ARM refusing to add circuitry for current instructions to the ALU if it could reduce uops by 15% either).

8. https://en.wikipedia.org/wiki/Peephole_optimization

The final optimization stage is basically heuristic find & replace. There could in theory be a mathematically provable "best instruction selection", but finding it would require trying every possible combination which isn't possible as long as P=NP holds true.

My favorite absurdity of x86 (though hardly the only one) is padding. You want to align function calls at cacheline boundaries, but that means padding the previous cache line with NOPs. Those NOPs translate into uops though. Instead, you take your basic, short instruction and pad it with useless bytes. Add a couple useless bytes to a bunch of instructions and you now have the right length to push the function over to the cache boundary without adding any NOPs.

But the issues go deeper. When do you use a REX prefix? You may want it so you can use 16 registers, but it also increases code size. REX2 with APX is going to increase this issue further where you must juggle when to use 8, 16, or 32 registers and when you should prefer the long REX2 because it has 3-register instructions. All kinds of weird tradeoffs exist throughout the system. Because the compilers optimize for the CPU and the CPU optimizes for the compiler, you can wind up in very weird places.

In an ISA like ARM, there isn't any code density weirdness to consider. In fact, there's very little weirdness at all. Write it the intuitive way and you're pretty much guaranteed to get good performance. Total time to work on the compiler is a zero-sum game given the limited number of experts. If you have to deal with these kinds of heuristic headaches, there's something else you can't be working on.

> My favorite absurdity of x86 (though hardly the only one) is padding. You want to align function calls at cacheline boundaries, but that means padding the previous cache line with NOPs. Those NOPs translate into uops though.

I'd call that more neat than absurd.

> You may want it so you can use 16 registers, but it also increases code size.

RISC-V has the exact same issue, some compressed instructions having only 3 bits for operand registers. And on x86 for 64-bit-operand instructions you need the REX prefix always anyways. And it's not that hard to pretty reasonably solve - just assign registers by their use count.

Peephole optimizations specifically here are basically irrelevant. Much of the complexity for x86 comes from just register allocation around destructive operations (though, that said, that does have rather wide-ranging implications). Other than that, there's really not much difference; all have the same general problems of moving instructions together for fusing, reordering to reduce register pressure vs putting parallelizable instructions nearer, rotating loops to reduce branches, branches vs branchless.

1. Performance. Also Arm implemented instruction cache coherency too.

Predecode/uop cache are both means to the same end, mitigating decode power. AMD and Intel have used both (though not on the same core). Arm has used both, including both on the same core for quite a few generations.

And a uop cache is just a cache. It's also big enough on current generations to cache more than just loops, to the point where it covers a majority of the instruction stream. Not sure where the misunderstanding of the uop cache "working as far in advance is possible" comes from. Unless you're talking about the BPU running ahead and prefetching into it? Which it does for L1i, and L2 as well?

2. "you can't run nothing but floats" they didn't do that in the paper, they did D += A[j] + B[j] ∗ C[j]. Something like matrix multiplication comes to mind, and that's not exactly a rare workload considering some ML stuff these days.

But also, has a study been done on Arm cores? For all we know they could spend similar power budgets on decode, or more. I could say an Arm core uses 99% of its power budget on decode, and be just as right as you are (they probably don't, my point is you don't have concrete data on both Arm and x86 decode power, which would be necessary for a productive discussion on the subject)

3. You're describing letting the BPU run ahead, which everyone has been doing for the past 15 years or so. Losing fetch bandwidth past a taken branch is a different thing.

4. Not sure where you're going. You started by suggesting Arm has less micro-op expansion than x86, and I provided a counterexample. Now you're talking about avoiding complex instructions, which a) compilers do on both architectures, they'll avoid stuff like division, and b) humans don't in cases where complex instructions are beneficial, see Linux kernel using rep movsb (https://github.com/torvalds/linux/blob/5189dafa4cf950e675f02...), and Arm introducing similar complex instructions (https://community.arm.com/arm-community-blogs/b/architecture...)

Also "complex" x86 instructions aren't avoided in the video encoding workload. On x86 it takes ~16.5T instructions to finish the workload, and ~19.9T on Arm (and ~23.8T micro-ops on Neoverse V2). If "complex" means more work per instruction, then x86 used more complex instructions, right?

8. You can use a variable length NOP on x86, or multiple NOPs on Arm to align function calls to cacheline boundaries. What's the difference? Isn't the latter worse if you need to move by more than 4 bytes, since you have multiple NOPs (and thus multiple uops, which you think is the case but isn't always true, as some x86 and some Arm CPUs can fuse NOP pairs)

But seriously, do try gathering some data to see if cacheline alignment matters. A lot of x86/Arm cores that do micro-op caching don't seem to care if a function (or branch target) is aligned to the start of a cacheline. Golden Cove's return predictor does appear to track targets at cacheline granularity, but that's a special case. Earlier Intel and pretty much all AMD cores don't seem to care, nor do the Arm ones I've tested.

Anyway, you're making a lot of unsubstantiated guesses on "weirdness" without anything to suggest it has any effect. I don't think this is the right approach. Instead of "tail wagging the dog" or whatever, I suggest a data-based approach where you conduct experiments on some x86/Arm CPUs, and analyze some x86/Arm programs. I guess the analogy is, tell the dog to do something and see how it behaves? Then draw conclusions off that?

That was true when ARM was first released, but over the years the decoder for ARM has gotten more and more complicated. Who would have guessed adding more specialized instructions would result in more complicated decoders? ARM now uses multi-stage decoders, just the same as x86.
Sure, but it's not idle power consumption that's the difference between these.
When a laptop gets 12 hours or more of battery life that's because it's 90% idle.
And while it's important to design a chip that can enter a deep idle state, the thing that differentiates one Windows laptop from the next is how many mistakes the BIOS writers made and whether the platform drivers work correctly. This is also why you cannot really judge the expected battery life under Linux by reading reviews of laptops running Windows.