| 1. Performance. Also Arm implemented instruction cache coherency too. Predecode/uop cache are both means to the same end, mitigating decode power. AMD and Intel have used both (though not on the same core). Arm has used both, including both on the same core for quite a few generations. And a uop cache is just a cache. It's also big enough on current generations to cache more than just loops, to the point where it covers a majority of the instruction stream. Not sure where the misunderstanding of the uop cache "working as far in advance is possible" comes from. Unless you're talking about the BPU running ahead and prefetching into it? Which it does for L1i, and L2 as well? 2. "you can't run nothing but floats" they didn't do that in the paper, they did D += A[j] + B[j] ∗ C[j]. Something like matrix multiplication comes to mind, and that's not exactly a rare workload considering some ML stuff these days. But also, has a study been done on Arm cores? For all we know they could spend similar power budgets on decode, or more. I could say an Arm core uses 99% of its power budget on decode, and be just as right as you are (they probably don't, my point is you don't have concrete data on both Arm and x86 decode power, which would be necessary for a productive discussion on the subject) 3. You're describing letting the BPU run ahead, which everyone has been doing for the past 15 years or so. Losing fetch bandwidth past a taken branch is a different thing. 4. Not sure where you're going. You started by suggesting Arm has less micro-op expansion than x86, and I provided a counterexample. Now you're talking about avoiding complex instructions, which a) compilers do on both architectures, they'll avoid stuff like division, and b) humans don't in cases where complex instructions are beneficial, see Linux kernel using rep movsb (https://github.com/torvalds/linux/blob/5189dafa4cf950e675f02...), and Arm introducing similar complex instructions (https://community.arm.com/arm-community-blogs/b/architecture...) Also "complex" x86 instructions aren't avoided in the video encoding workload. On x86 it takes ~16.5T instructions to finish the workload, and ~19.9T on Arm (and ~23.8T micro-ops on Neoverse V2). If "complex" means more work per instruction, then x86 used more complex instructions, right? 8. You can use a variable length NOP on x86, or multiple NOPs on Arm to align function calls to cacheline boundaries. What's the difference? Isn't the latter worse if you need to move by more than 4 bytes, since you have multiple NOPs (and thus multiple uops, which you think is the case but isn't always true, as some x86 and some Arm CPUs can fuse NOP pairs) But seriously, do try gathering some data to see if cacheline alignment matters. A lot of x86/Arm cores that do micro-op caching don't seem to care if a function (or branch target) is aligned to the start of a cacheline. Golden Cove's return predictor does appear to track targets at cacheline granularity, but that's a special case. Earlier Intel and pretty much all AMD cores don't seem to care, nor do the Arm ones I've tested. Anyway, you're making a lot of unsubstantiated guesses on "weirdness" without anything to suggest it has any effect. I don't think this is the right approach. Instead of "tail wagging the dog" or whatever, I suggest a data-based approach where you conduct experiments on some x86/Arm CPUs, and analyze some x86/Arm programs. I guess the analogy is, tell the dog to do something and see how it behaves? Then draw conclusions off that? |
L1 cache is "free" in that you can fill it with simple data moves. uop cache requires actual work to decode and store elements for use in addition to moving the data. As to working ahead, you already covered this yourself. If you have a nearly 1-to-1 instruction-to-uop ratio, having just 4 decoders (eg, zen4) is a problem because you can execute a lot more than just 4 instructions on the backend. 6-wide Zen4 means you use 50% more instructions than you decode per clock. You make up for this in loops, but that means while you're executing your current loop, you must be maxing out the decoders to speculatively fill the rest of the uop cache before the loop finishes. If the loop finishes and you don't have the next bunch of instructions decoded, you have a multi-cycle delay coming down the pipeline.
2. I'd LOVE to see a similar study of current ARM chips, but I think the answer here is pretty simple to deduce. ARM's slide says "4x smaller decoders vs A710" despite adding a 5th decoder. They claim 20% reduction in power at the same performance and the biggest change is the decoder. As x86 decode is absolutely more complex than aarch32, we can only deduce that switching from x86 to aarch64 would be an even more massive reduction. If we assume an identical 75% reduction in decoder power, we'd move from 4.8w on haswell the decoder down to 1.2w reducing total core power from 22.1 to 18.5 or a ~16% overall reduction in power. This isn't too far from to the power numbers claimed by ARM.
4. This was a tangent. I was talking about uops rather than the ISA. Intel claims to be simple RISC internally just like ARM, but if Intel is using nearly 30% fewer uops to do the same work, their "RISC" backend is way more complex than they're admitting.
8. I believe aligning functions to cacheline boundaries is a default flag at higher optimization levels. I'm pretty sure that they did the analysis before enabling this by default. x86 NOP flexibility is superior to ARM (as is its ability to avoid them entirely), but the cause is the weirdness of the x86 ISA and I think it's an overall net negative.
Loads of x86 instructions are microcode only. Use one and it'll be thousands of cycles. They remain in microcode because nobody uses them, so why even try to optimize and they aren't used because they are dog slow. How would you collect data about this? Nothing will ever change unless someone pours in millions of dollars in man-hours into attempting to speed it up, but why would anyone want to do that?
Optimizing for a local maxima rather than a global maxima happens all over technology and it happens exactly because of the data-driven approach you are talking about. Look for the hot code and optimize it without regard that there may be a better architecture you could be using instead. Many successes relied on an intuitive hunch.
ISA history has a ton of examples. iAPX432 super-CISC, the RISC movement, branch delay slots, register windows, EPIC/VLIW, Bulldozer's CMT, or even the Mill design. All of these were attempts to find new maxima with greater or lesser degrees of success. When you look into these, pretty much NONE of them had any real data to drive them because there wasn't any data until they'd actually started work.