| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sm_1024 677 days ago

IMO, the most interesting thing about this line is the battery life---within an hour of MBP3 and within 2 hours of Asus's Qualcomm. Making it comparable to ARM architectures.

Which is a little surprising because ARM is commonly believed to be much more power efficient than x86.

[1] https://youtu.be/Z8WKR0VHfJw?si=A7zbFY2lsDa8iVQN&t=277

11 comments

arnaudsm 677 days ago

ARM got a lot of hype since the release of the M1, but most users only compared it to the terrible Intel MBPs. Ryzen mobile has been consistently close to Apple silicon perf/watt for 5 years. But got little press coverage.

Hype can be really decorrelated from real world performance.

jsheard 677 days ago

Any efficiency comparison involving Apples chips also has to factor in that Tim Cook keeps showing up at TSMCs door with a freight container full of cash to buy out exclusive access to their bleeding edge silicon processes. ARM may be a factor but don't underestimate the power of having more money than God.

Case in point, Strix Point is built on TSMC 4nm while Apple is already using TSMCs second generation 3nm process.

hajile 677 days ago

Let's do the math on M1 Pro (10-core, N5, 2021) vs HX370 (12-core, N4P, 2024).

Firestorm without L3 is 2.281mm2. Icestorm is 0.59mm2. M1 Pro has 8P+2E for a total of 19.428mm2 of cores included.

Zen4 without L3 is 3.84mm2. Zen4c reduces that down to 2.48mm2. Zen5 CCD is pretty much the same size as Zen4 (though with 27% more transistors), so core size should be similar. AMD has also stated that Zen5c has a similar shrink percent to Zen4c. We'll use their numbers. HX370 has 4P+8C for a total area of 35.2mm2. If being twice the size despite being on N4P instead of N5 like M1 seems like foreshadowing, it is.

We'll use notebookcheck's Cinebench 2024 multithread power and performance numbers to calculate perf / power / area then multiply that by 100 to eliminate some decimals.

M1 Pro scores 824 (10-core) and while they don't have a power value listed, they do list 33.6w package power running the prime95 power virus, so cinebench's power should be lower than that.

HX370 scored 1213 (12-core) and averaged 119w (maxing at a massive 121.7w and that's without running a power virus).

This gives the following perf/power/area*100 scores:

M1 Pro — 126 PPA

HX 379 — 29 PPA

M1 is more than 4.3x better while being an entire node behind and being released years before.

sm_1024 677 days ago

119W for hx370 looks extremely sus, seems to me more like the system level power consumption and not CPU-only.

According to phoronix [1,2], in their blender CPU test, they measured a peak of 33W.

Here max power numbers from some other tests that I know are multi-threaded:

--

Linux 6.8 Compilation: 33.13 W

LLVM Compilation: 33.25 W

--

If I plug in 33W into your equation, that would give us score of HX 370: 104 PPA

This supports the HX 370 being pretty power efficient, although still not as power efficient as M3.

[1] https://www.phoronix.com/review/amd-ryzen-ai-9-hx-370/3

[2] https://www.phoronix.com/review/amd-ryzen-ai-9-hx-370/4

hajile 677 days ago

https://www.notebookcheck.net/AMD-Zen-5-Strix-Point-CPU-anal...

They got those kinds of numbers across multiple systems. You can take it up with them I guess.

I didn't even mention one of these systems was peaking at 59w on single-core workloads.

sm_1024 677 days ago

I see what's going on, they have two HX370 laptops:

  Laptop  MC score  Avg Power
     P16      1213      113 W
     S16       921       29 W
  M3 Pro      1059    (30 W?)

They don't have M3 Pro power numbers, but I assume it is somewhere around 30W, seems like S16 has similar power efficiency as HX 370 at 30 W.

Any more power, and the CPU is much less power efficient, 300% increase in power for 30% increase in performance.

moonfern 676 days ago

About cinebench-geekbench-spec: https://old.reddit.com/r/hardware/comments/pitid6/eli5_why_d... That's about Cinebench 20, an overview of Cinebench 24 cpu&gpu(!): https://www.cgdirector.com/cinebench-2024-scores/

jeswin 676 days ago

Even with the M3 the difference is marginal in multi-threaded benchmarks, from the Cinebench link [1] someone posted earlier on the thread.

    Apple M3 Pro 11-Core - 394 Points per Watt
    AMD Ryzen AI 9 HX 370 - 354 Points per Watt
    Apple M3 Max 16-Core - 306 Points per Watt

And the Ryzen in on TSMC 4nm while the M3 is on 3nm. As parent is saying, a lot of the Apple Silicon hype was due to the massive upgrade it was over the Intel CPUs Apple was using previously.

[1]: https://www.notebookcheck.net/AMD-Zen-5-Strix-Point-CPU-anal...

dagmx 676 days ago

Their efficiency tests use Cinebench R23 (as called out explicitly).

R23 is not optimized for Apple silicon but is for x86. The R24 numbers are actually what you need for a fair comparison, otherwise you put the Arm numbers at a significant handicap.

merb 673 days ago

That the max should be worse than the m3 pro is a little bit shady.

janwas 676 days ago

Cinebench might not be the most relevant benchmark, it uses lots of scalar instructions with fairly high branch mispredictions and low IPC: https://chipsandcheese.com/2021/02/22/analyzing-zen-2s-cineb....

cyp0633 677 days ago

Power efficiency is a curve, and Apple may have its own reason not to make M1 Pro run at 110W as well

sm_1024 677 days ago

I think the OC might have mis-read the power numbers, 110 W is well into desktop CPU power range. Here is a excerpt from Anand Tech:

> In our peak power test, the Ryzen AI 9 HX 370 ramped up and peaked at 33 W.

https://www.anandtech.com/show/21485/the-amd-ryzen-ai-hx-370...

hajile 677 days ago

You can read the notebookcheck review for yourself.

https://www.notebookcheck.net/AMD-Zen-5-Strix-Point-CPU-anal...

hajile 677 days ago

I stacked the deck in AMD's favor using a 3-year-old chip on an older node.

Why is AMD using 3.6x more power than M1 to get just 32% higher performance while having 17% more cores? Why are AMD's cores nearly 2x the size despite being on a better node and having 3 more years to work on them?

Why are Apple's scores the same on battery while AMD's scores drop dramatically?

Apple does have a reason not to run at 120w -- it doesn't need to.

Meanwhile, if AMD used the same 33w, nobody would buy their chips because performance would be so incredibly bad.

pickledish 677 days ago

You should try not to talk so confidently about things you don't know about -- this statement

> if AMD used the same 33w, nobody would buy their chips because performance would be so incredibly bad

Is completely incorrect, as another commenter (and I think the notebookcheck article?) point out -- 30w is about the sweet spot for these processors, and the reason that 110w laptop seems so inefficient is because it's giving the APU 80w of TDP, which is a bit silly since it only performs marginally better than if you gave it e.g. 30 watts. It's not a good idea to take that example as a benchmark for the APU's efficiency, it varies depending on how much TDP you give the processor, and 80w is not a good TDP for these

Const-me 676 days ago

> if AMD used the same 33w, nobody would buy their chips because performance would be so incredibly bad

I’m writing this comment on HP ProBook 445 G8 laptop. I believe I bought it in early 2022, so it's a relatively old model. The laptop has a Ryzen 5 5600U processor which uses ≤ 25W. I’m quite happy with both the performance and battery life.

atq2119 676 days ago

It's well known that performance doesn't scale linearly with power.

Benchmarking incentives on PC have long pushed X86 vendors to drive their CPUs at points of the power/performance curve that make their chips look less efficient than they really are. Laptop benchmarking has inherited that culture from desktop PC benchmarking to some extent. This is slowly changing, but Apple has never been subject to the same benchmarking pressures in the first place.

You'll see in reviews that Zen5 can be very efficient when operated in the right power range.

AnthonyMouse 676 days ago

> I stacked the deck in AMD's favor using a 3-year-old chip on an older node.

You could just compare the ones that are actually on the same process node:

https://www.notebookcheck.net/R9-7945HX3D-vs-M2-Max_15073_14...

But then you would see an AMD CPU with a lower TDP getting higher benchmark results.

> Why is AMD using 3.6x more power than M1 to get just 32% higher performance while having 17% more cores?

Getting 32% higher performance from 17% more cores implies higher performance per core.

The power measurements that site uses are from the plug, which is highly variable to the point of uselessness because it takes into account every other component the OEM puts into the machine and random other factors like screen brightness, thermal solution and temperature targets (which affects fan speed which affects fan power consumption) etc. If you measure the wall power of a system with a discrete GPU that by itself has a TDP >100W and the system is drawing >100W, this tells you nothing about the efficiency of the CPU.

AMD's CPUs have internal power monitors and configurable power targets. At full load there is very little light between the configured TDP and what they actually use. This is basically required because the CPU has to be able to operate in a system that can't dissipate more heat than that, or one that can't supply more power.

> Meanwhile, if AMD used the same 33w, nobody would buy their chips because performance would be so incredibly bad.

33W is approximately what their mobile CPUs actually use. Also, even lower-configured TDP models exist and they're not that much slower, e.g. the 7840U has a base TDP of 15W vs. 35W for the 7840HS and the difference is a base clock of 3.3GHz instead of 3.8GHz.

acdha 677 days ago

Process helps but have you seen benchmarks showing equivalent performance between the same process node? I think it’s less that ARM is amazing than the Apple Silicon team being very good and paired with aggressive optimization throughout the stack but everything I’ve seen suggests they are simply building better chips at their target levels (not server, high power, etc.).

cubefox 677 days ago

> Our benchmark database shows the Dimensity 9300 scores 2,207 and 7,408 in Geekbench 6.2's single and multi-core tests. A 30% performance improvement implies the Dimensity 9400 would score around 2,869 and and 9,630. Its single-core performance is close to that of the Snapdragon 8 Gen 4 (2,884/8,840) and it understandably takes the lead in multi-core. Both are within spitting distance from the Apple A17 Pro, which scores 2,915 and 7,222 points in the benchmark. Then again, all three chips are said to be manufactured on TSMC's N3 class node, effectively leveling the playing field.

https://www.notebookcheck.net/MediaTek-Dimensity-9400-rumour...

acdha 677 days ago

That appears to be an unconfirmed rumor and it’s exciting if true (and there aren’t major caveats on power), but did you notice how they mentioned extra work by ARM? The argument isn’t that Apple is unique, it’s that the performance gaps they’ve shown are more than simply buying premium fab capacity.

That doesn’t mean other designers can’t also do that work, but simply that it’s more than just the process - for example, the M2 shipped on TSMC’s N5P first as an exclusive but when Zen 5 shipped later on the same process it didn’t close the single core performance or perf/watt gap. Some of that is x86 vs. ARM but there isn’t a single, simple factor which can explain this - e.g. Apple carefully tuning the hardware, firmware, OS, compilers, and libraries too undoubtably helps a lot and it’s been a perennial problem for non-Intel vendors on the PC side since so many developers have tuned for Intel first/only for decades.

AnthonyMouse 676 days ago

> for example, the M2 shipped on TSMC’s N5P first as an exclusive but when Zen 5 shipped later on the same process it didn’t close the single core performance or perf/watt gap.

That was Zen 4, but it did close the gap:

https://www.notebookcheck.net/R9-7945HX3D-vs-M2-Max_15073_14...

Single thread performance is higher (so is MT), TDP is slightly lower, Cinebench MT "points per watt" is 5% higher.

We'll get to see it again when the 3nm version of Zen5 is released (the initial ones are 4nm, which is a node Apple didn't use).

cubefox 677 days ago

Since it's unclear whether Apple has a significant architectural advantage over Qualcomm and MediaTek, I would rather attribute this to relatively poor AMD architectures. Provisionally. At least their GPUs have been behind Nvidia for years. (AMD holding its own against Intel is not surprising given Intel's chip fab problems.)

sroussey 677 days ago

I guess getting close to the same single thread score is nice. Unfortunately, since only Apple is shipping it is hard to compare if the others burn the battery to get there.

I suspect the others two, like Apple with the A18 shipping next month, will be using the second gen N3. Apple is expected to be around 3500 on that node.

Needless to say, what will be very interesting is to see the perf/watt of all three on the same node and shipping in actual products where the benchmarks can be put to more useful tests.

cubefox 677 days ago

Yeah, and GPU tests, since the benchmarks above were only for the CPU.

carstenhag 677 days ago

But how is this the case? I never saw a single article mentioning that a non-Mac laptop was better.

(Random article saying M3 pro is better than a Dell laptop https://www.tomsguide.com/news/macbook-pro-m3-and-m3-max-bat... )

moonfern 677 days ago

You're right, but... The idea comes from the desktop world. AMD's zen 4 desktop cpu's especially the gaming variants like the Ryzen 7 7800X3D almost matches the performance per watt of Apple's M3.

Their laptop cpu's as some companies did release same model different cpu were less efficient than intel.

But the Asus ProArt P16 (used in the article) did manage an extreme endurance score in the video test called Big Buck Bunny H.264 1080p which runs at 150 cd/m² with 21 hours. With it's higher resolution, oled and 10% less battery capacity that's better 40 minutes better than the macbook pro 16 m3 max. In the wifi test also run at 150 cd/m² the m3 run for 16 hours, the asus 8. ( https://www.notebookcheck.net/Asus-ProArt-P16-laptop-review-... )

For me noise matters, that Asus has a whisper mode which produces 42db as much as an M3 max under full load. Please be aware that if you're susceptible of PWM, that ASUS laptop has issues.

sm_1024 677 days ago

I have heard that part of the reason for little coverage of ryzen mobile CPUs is their limited availability as AMD was focussing on using the fab capacity for server chips.

sandywaffles 676 days ago

I think that's because all the press talks about actual battery life per laptop and the Apple Silicone laptops ship with literally double the size battery of any AMD based laptop without a discrete GPU. So while the efficiency may be close, actually perceived battery life of the Mac will he more than double when you also consider the priority Apple puts into their power control combined with a larger overall battery.

Filligree 677 days ago

Ryzen mobile is consistently close, yeah. But with the sole exception of the Steam deck, I've yet to see a Ryzen mobile-bearing laptop, Windows included, which is close to the overall performance of the Macbook.

makeitdouble 677 days ago

"overall performance" does a lot of work here. On sheer benchmarks it's really comparable, with AMD being slightly better depending on what you look at. e.g. the M1 vs the 5700U (a similar class widely available mobile CPU):

https://www.cpubenchmark.net/cpu.php?cpu=AMD%20Ryzen%207%205...

https://www.cpubenchmark.net/cpu.php?cpu=Apple+M1+8+Core+320...

They're not profiled the same, and don't belong in the same ecosystem though, which makes a lot more difference than the CPU themselves. In particular the AMD doesn't get a dedicated compiler optimizing every applications of the system to its strength and weaknesses (the other side of it being the compatibility with the two vastest ecosystem we have now)

sofixa 676 days ago

Depends on what you mean by "overall performance", but my Asus ROG Zephyrus G14 2023 is full AMD, and outperforms my work issued top of the line M1 MacBook Pro from a few months earlier in every task I've done across the two (gaming, compiling, heavy browsing). Battery life is lower under heavy load and high performance on the Zephyrus, but in power saving mode it's roughly comparable, albeit still worse.

izacus 676 days ago

Same here, my G14 and the M1 MBP are pretty much interchangeable for most workloads. The only time then G14 starts fans is when the 4070 turns on... and that's not an option on the M1 at all.

talldayo 677 days ago

> But with the sole exception of the Steam deck

Uuh wut? The Steam Deck is like 3-generation-old hardware in mobile Ryzen terms. In a lot of ways it's similar to a pared-back 4800u with fewer (and older) cores, and a slightly bumped up GPU.

To me it's kinda the opposite. Excluding the Steam Deck, I think most of AMD's Ultrabook APUs have been very close to the products Apple's made on the equivalent nodes. Even on 7nm the 4800u put up a competitive fight against M1, and the gap has gotten thinner with each passing year. According to the OpenCL benchmarks, the Radeon 680m on 6nm scores higher than the M1 on 5nm: https://browser.geekbench.com/opencl-benchmarks

Even back when Ryzen Mobile only shipped with Vega, it was pretty clear that Apple and AMD were a pretty close match in onboard GPU power.

amlib 677 days ago

Steam Deck might be behind in terms of hardware but in terms of software it's way beyond your typical x86 linux system power efficiency, and dare I say it's doing better than windows machines with the typical shoddy bioses and drivers, specially when you consider all the extraneous services constantly sapping varying amounts of cpu time. All that contributes to make the SD punch well above its weight.

sudosysgen 676 days ago

My Alienware M15 Ryzen edition gets 7-8W power consumption by just running "sudo powertop --autotune". Basically all of the power efficiency stuff in the Steam Deck apply to other Ryzen systems and are in the mainline kernel.

dagmx 677 days ago

Battery tests are important, but so is how it fairs on battery (what is the performance drop off to maintain that), what’s its performance is ant its peak and how it long before it throttles when pushed.

The M series processors have succeeded in all four: battery life, performance parity between battery and plugged in, high performance and performance sustainability.

So far, very few benchmarks have been comparing the latter three as part of the full package assessment.

jiggawatts 677 days ago

> because ARM is commonly believed to be much more power efficient than x86.

Because most ARM processors were designed for mobile phones and optimised to death for power efficiency.

The total power usage of the front end decoders is a single digit percentage of the total power draw. Even if ARM magically needed 0 watts for this, it couldn’t save more power than that. The rest of the processor design elements are essentially identical.

Panzer04 677 days ago

>5hr Battery life in laptops is mostly a function of how well idle is managed, i think. The less work you can do while running the users core program, the better. I'm not sure how much impact CPU efficiency really has in that case.

If you are running a remotely demanding program (say, a game) , your battery life will be bad no matter what (ie. <4hrs) unless you choose a very low TDP that performs badly always.

A laptop at idle should be able to manage ~5w power consumption sumtpion regardless of AMD/intel/Apple processor, but it's largely on the OS to achieve that.

999900000999 677 days ago

I have a 365 AMD laptop.

The battery is great if your doing very light stuff, Call of Duty takes it's battery down to 3 hours.

Macs don't really support higher end games, so I can't directly compare to my M1 Air.

sedatk 676 days ago

How does “great” translate to hours?

999900000999 676 days ago

This is really tricky.

The OEMs will use ever trick possible and do something like open GMAIL to claim 10 hours, but given my typical use I average 5 to 6. I make music using a software called Maschine.

It's a massive step up over my old( still working just very heavy) Lenovo Legion 2020, which would last about 2 hours given the same usage.

This is all subjective at the end of the day. If none of your applications actually work since your on ARM Windows of course you'll have higher battery life.

wtallis 677 days ago

The CPU core's instruction set has no influence on how well the chip as a whole manages power when not executing instructions.

sm_1024 677 days ago

That is fair, I was taught that decoders for x86 are less efficient and more power hungry than RISC ISAs because of their variable length instructions.

I remember being told (and it might be wrong) that ARM can decode multiple instructions in parallel because the CPU knows where the next instruction starts, but for x86, you'd have to decode the instructions in order.

pohuing 677 days ago

That seems to not matter much nowadays. There's another great(according to my untrained eye) writeup of the lack of importance on chips and cheese.

https://chipsandcheese.com/2021/07/13/arm-or-x86-isa-doesnt-...

dzaima 677 days ago

The various mentioned power consumption amounts are 4-10% per-core, or 0.5-6% of package (with the caveat of running with micro-op cache off) for Zen 2, and 3-10% for Haswell. That's not massive, but is still far from what I'd consider insignificant; it could give leeway for an extra core or some improved ALUs; or, even, depending on the benchmark, is the difference between Zen 4 and Zen 5 (making the false assumption of a linear relation between power and performance, at least), which'd essentially be a "free" generational improvement. Of course the reality is gonna be more modest than that, but it's not nothing.

Panzer04 677 days ago

You missed the part where they mention ARM ends up implementing the same thing to go fast.

The point is processors are either slow and efficient, or fast and inefficient. It's just a tradeoff along the curve.

dzaima 677 days ago

ARM doesn't need the variable-length instruction decoding though, which on x86 essentially means that the decoder has to attempt to decode at every single byte offset for the start of the pipeline, wasting computation.

Indeed pretty much any architecture can benefit from some form of op cache, but less of a need for it means its size can be reduced (and savings spent in more useful ways), and you'll still need actual decoding at some point anyway (and, depending on the code footprint, may need it a lot).

More generally, throwing silicon at a problem is, quite obviously, a more expensive solution than not having the problem in the first place.

hajile 677 days ago

That stuff is WAY out-of-date and was flatly wrong when it was published.

A715 cut decoder size a whopping 75% by dropping the more CISC 32-bit stuff and completely eliminated the uop cache too. Losing all that decode, cache, and cache controllers means a big reduction in power consumption (decoders are basically always on). All of ARM's latest CPU designs have eliminated uop cache for this same reason.

At the time of publication, we already knew that M1 (already out for nearly a year) was the highest IPC chip ever made and did not use a uop cache.

hajile 677 days ago

Clam makes some serious technical mistakes in that article and some info is outdated.

1. His claim that "ARM decoder is complex too" was wrong at the time (M1 being an obvious example) and has been proven more wrong since publication. ARM dropped the uop cache as soon as they dropped support for their very CISC-y 32-bit catastrophe. They bragged that this coincided with a whopping 75% reduction in decoder size for their A715 (while INCREASING from 4 decoders to 5) and this was almost single-handedly responsible for the reduced power consumption of that chip (as all the other changes were comparatively minor). NONE of the current-gen cores from ARM, Apple, or Qualcomm use uop cache eliminating these power-hungry cache and cache controllers.

2. The paper[0] he quotes has a stupid conclusion. They show integer workloads using a massive 22% of total core power on the decoder and even their fake float workload showed 8% of total core power. Realize that a study[1] of the entire Ubuntu package repo showed that just 12 int/ALU instructions made up 89% of all code with float/SIMD being in the very low single digits of use.

3. x86 decoder situation has gotten worse. Because adding extra decoders is exponentially complex, they decided to spend massive amounts of transistors on multiple decoder blocks working on various speculated branches. Setting aside that this penalizes unrolled code (where they may have just 3-4 decoders while modern ARM will have 10+ decoders), the setup for this is incredibly complex and man-year intensive.

4. "ARM decodes into uops too" is a false equivalency. The uops used by ARM are extremely close to the original instructions as shown by them being able to easily eliminate the uop cache. x86 has a much harder job here mapping a small set of instructions onto a large set.

5. "ARM is bloated too". ARM redid their entire ISA to eliminate bloat. If ISA didn't actually matter, why would they do this?

6. "RISC-V will become bloated too" is an appeal to ignorance. x86 has SEVENTEEN major SIMD extensions excluding the dozen or so AVX-512 extensions all with various incompatibilities and issues. This is because nobody knew what SIMD should look like. We know now and RISC-V won't be making that mistake. x86 has useless stuff like BCD instructions using up precious small instruction space because they didn't know. RISC-V won't do this either. With 50+ years of figuring the basics out, RISC-V won't be making any major mistakes on the most important stuff.

7. Omitting complexity. A bloated, ancient codebase takes forever to do anything with. A bloated, ancient ISA takes forever to do anything with. If ARM and Intel both put X dollars into a new CPU design, Intel is going to spend 20-30% or maybe even more of their budget on devs spending time chasing edge cases and testers to test al those edge cases. Meanwhile, ARM is going to spend that 20-30% of their budget on increasing performance. All other things equal, the ARM chip will be better at any given design price point.

8. Compilers matter. Spitting out fast x86 code is incredibly hard because there are so many variations on how to do things each with their own tradeoffs (that conflate in weird ways with the tradeoffs of nearby instructions). We do peephole heuristic optimizations because provably fast would take centuries. RISC-V and ARM both make it far easier for compiler writers because there's usually just one option rather than many options and that one option is going to be fast.

[0] https://www.usenix.org/system/files/conference/cooldc16/cool...

[1] https://oscarlab.github.io/papers/instrpop-systor19.pdf

NobodyNada 676 days ago

One more: there's more to an ISA than just the instructions; there's semantic differences as well. x86 dates to a time before out-of-order execution, caches, and multi-core systems, so it has an extremely strict memory model that does not reflect modern hardware -- the only memory-reordering optimization permitted by the ISA is store buffering.

Modern x86 processors will actually perform speculative weak memory accesses in order to try to work around this memory model, flushing the pipeline if it turns out a memory-ordering guarantee was violated in a way that became visible to another core -- but this has complexity and performance impacts, especially when applications make heavy use of atomic operations and/or communication between threads.

Simple atomic operations can be an order of magnitude faster on ARMv8 vs x86: https://web.archive.org/web/20220129144454/https://twitter.c...

StillBored 675 days ago

"the only memory-reordering optimization permitted by the ISA is store buffering."

I think this is a mischaracterization of TSO. TSO only dictates the store ordering to other entities in the system, the individual cores are fully capable of using the results of stores that are not yet visible for their own OoO purposes as long as the dataflow dependencies are correctly solved. The complexities of the read/write bypassing is simply to clarify correct program order.

And this is why the TSO/non TSO mode on something like the apple cores doesn't seem to make a huge difference, particularly if one assumes that the core is aggressively optimized for the arm memory model, and the TSO buffering/ordering is not a critical optimization point.

Put another way, a core designed to track store ordering utilizing some kind of writeback merging is going to be fully capable of executing just as aggressively OoO and holding back or buffering the visibility of completed stores until earlier stores complete. In fact for multithreaded lock-free code the lack of explicit write fencing is likely a performance gain for very carefully optimized code in most cases. A core which can pipeline and execute multiple outstanding store fences is going to look very similar to one that implements TSO.

sroussey 676 days ago

Yes, and Apple added this memory model to their ARM implementation so Rosetta2 would work well.

dzaima 677 days ago

Some notes:

3: I don't think more decoders should be exponentially more complex, or even polynomial; I think O(n log n) should suffice. It just has a hilarious constant factor due to the lookup tables and logic needed, and that log factor also impacts the critical path length, i.e. pipeline length, i.e. mispredict penalty. Of note is that x86's variable-length instructions aren't even particularly good at code size.

Golden Cove (~1y after M1) has 6-wide decode, which is probably reasonably near M1's 8-wide given x86's complex instructions (mainly free single-use loads). [EDIT: actually, no, chipsandcheese's diagram shows it only moving 6 micro-ops per cycle to reorder buffer, even out of the micro-op cache. Despite having 8/cycle retire. Weird.]

6: The count of extensions is a very bad way to measure things; RISC-V will beat everything in that in no time, if not already. The main things that matter are ≤SSE4.2 (uses same instruction encoding as scalar code); AVX1/2 (VEX prefix); and AVX-512 (EVEX). The actual instruction opcodes are shared across those. But three encoding modes (plus the three different lengths of the legacy encoding) is still bad (and APX adds another two onto this) and the SSE-to-AVX transition thing is sad.

RISC-V already has two completely separate solutions for SIMD - v (aka RVV, i.e. the interesting scalable one) and p (a simpler thing that works in GPRs; largely not being worked on but there's still some activity). And if one wants to count extensions, there are already a dozen for RVV (never mind its embedded subsets) - Zvfh, Zvfhmin, Zvfbfwma, Zvfbfmin, Zvbb, Zvkb, Zvbc, Zvkg, Zvkned, Zvknhb, Zvknha, Zvksed, Zvksh; though, granted, those work better together than, say, SSE and AVX (but on x86 there's no reason to mix them anyway).

And RVV might get multiple instruction encoding forms too - the current 32-bit one is forced into allowing using only one register for masking due to lack of encoding space, and a potential 48-bit and/or 64-bit instruction encoding extension has been discussed quite a bit.

8: RISC-V RVV can be pretty problematic for some things if compiling without a specific target architecture, as the scalability means that different implementations can have good reason to have wildly different relative instruction performance (perhaps most significant being in-register gather (aka shuffle) vs arithmetic vs indexed load from memory).

hajile 677 days ago

3. You can look up the papers released in the late 90s on the topic. If it was O(n log n), going bigger than 4 full decoders would be pretty easy.

6. Not all of those SIMD sets are compatible with each other. Some (eg, SSE4a) wound up casualties of the Intel v AMD war. It's so bad that the Intel AVX10 proposal is mostly about trying to unify their latest stuff into something more cohesive. If you try to code this stuff by hand, it's an absolute mess.

The P proposal is basically DOA. It could happen, but nobody's interested at this point. Just like the B proposal subsumed a bunch of ridiculously small extensions, I expect a new V proposal to simply unify these. As you point out, there isn't really any conflict between these tiny instruction releases.

There is discussion around the 48-bit format (the bits have been reserved for years now), but there are a couple different proposals (personally, I think 64-bit only with the ability to put multiple instructions inside is better, but that's another topic). Most likely, a 48-bit format does NOT do multiple encoding, but instead does a superset of encodings (just like how every 16-bit instruction expands into a 32-bit instruction). They need/want 48-bits to allow 4-address instructions too, so I'd imagine it's coming sooner or later.

Either way, the length encoding is easy to work with compared to x86 where you must check half the bits in half the bytes before you can be sure about how long your instruction really is.

8. There could be some variance, but x86 has this issue too and SO many more besides.

dzaima 676 days ago

Expanding on 3: I think it ends up at O(n^2 * log n) transistors, O(log n) critical path (not sure on routing or what fan-out issues might there be).

Basically: determine end of instruction at each byte (trivial but expensive). Determine end of two instructions at each byte via end2[i]=end[end[i]]. Then end4[i]=end2[end2[i]], etc, log times.

That's essentially log(n) shuffles. With 32-byte/cycle decode that's roughy five 'vpermb ymm's, which is rather expensive (though various forms of shortcuts should exist - for the larger layers direct chasing is probably feasible, and for the smaller ones some special-casing of single-byte instructions could work).

And, actually, given the mention of O(log n)-transistor shuffles at http://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardo..., it might even just be O(n * log^2(n)) transistors.

Importantly, x86 itself plays no part in the non-trivial part. It applies equqlly to the RISC-V compressed extension, just with a smaller constant.

janwas 676 days ago

> With 50+ years of figuring the basics out, RISC-V won't be making any major mistakes on the most important stuff.

RVV does have significant departures from prior work, and some of them are difficult to understand:

- the whole concept of avl, which adds complexity in many areas including reg renaming. From where I sit, we could just use masks instead.

- mask bits reside in the lower bits of a vector, so we either require tons of lane-crossing wires or some kind of caching.

- global state LMUL/SEW makes things hard for compilers and OoO.

- LMUL is cool but I imagine it's not fun to implement reductions, and vrgather.

dzaima 676 days ago

How does avl affect register renaming? (there's the edge-case of vl=0 that is horrifically stupid (which is by itself a mistake for which I have seen no justification but whatever) but that's probably not what you're thinking of?) Agnostic mode makes it pretty simple for hardware to do whatever it wants.

Over masks it has the benefit of allowing simple hardware short-circuiting, though I'd imagine it'd be cheap enough to 'or' together mask bit groups to short-circuit on (and would also have the benefit of better masked throughput)

Cray-1 (1976) had VL, though, granted, that's a pretty long span of no-VL until RVV.

clamchowder 676 days ago

Some notes: 1. Consider M1's 8-wide decoder hit the 5+ GHz clock speeds that Intel Golden Cove's decoder can. More complex logic with more delays is harder to clock up. Of course M1 may be held back by another critical path, but it's interesting that no one has managed to get a 8-wide Arm decoder running at the clock speeds that Zen 3/4 and Golden Cove can.

A715's slides say the L1 icache gains uop cache features including caching fusion cases. Likely it's a predecode scheme much like AMD K10, just more aggressive with what's in the predecode stage. Arm has been doing predecode (moving some stages to the L1i fill path rather than the hotter L1i hit path) to mitigate decode costs for a long time. Mitigating decode costs again with a uop cache never made much sense especially considering their low clock speeds. Picking one solution or the other is a good move, as Intel/AMD have done. Arm picked predecode for A715.

2. The paper does not say 22% of core power is in the decoders. It does say core power is ~22% of package power. Wrong figure? Also, can you determine if the decoder power situation is different on Arm cores? I haven't seen any studies on that.

3. Multiple decoder blocks doesn't penalize decoder blocks once the load balancing is done right, which Gracemont did. And you have to massively unroll a loop to screw up Tremont anyway. Conversely, decode blocks may lose less throughput with branchy code. Consider that decode slots after a taken branch are wasted, and clustered decode gets around that. Intel stated they preferred 3x3 over 2x4 for that reason.

4. "uops used by ARM are extremely close to the original instructions" It's the same on x86, micro-op count is nearly equal to instruction count. It's helpful to gather data to substantiate your conclusions. For example, on Zen 4 and libx264 video encoding, there's ~4.7% more micro-ops than instructions. Neoverse V2 retires ~19.3% more micro-ops than instructions in the same workload. Ofc it varies by workload. It's even possible to get negative micro-op expansion on both architectures if you hit branch fusion cases enough.

8. You also have to tell your ARM compiler which of the dozen or so ISA extension levels you want to target (see https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html#inde...). It's not one option by any means. Not sure what you mean by "peephole heuristic optimizations", but people certainly micro-optimize for both arm and x86. For arm, see https://github.com/dotnet/runtime/pull/106191/files as an example. Of course optimizations will vary for different ISAs and microarchitectures. x86 is more widely used in performance critical applications and so there's been more research on optimizing for x86 architectures, but that doesn't mean Arm's cores won't benefit from similar optimization attention should they be pressed into a performance critical role.

neonsunset 676 days ago

> Not sure what you mean by "peephole heuristic optimizations"

Post-emit or within-emit stage optimization where a sequence of instructions is replaced with a more efficient shorter variant.

Think replacing pairs of ldr and str with ldp and stp, changing ldr and increment with ldr with post-index addressing mode, replacing address calculation before atomic load with atomic load with addressing mode (I think it was in ARMv8.3-a?).

The "heuristic" here might be possibly related to additional analysis when doing such optimizations.

For example, previously mentioned ldr, ldr -> ldp (or stp) optimization is not always a win. During work on .NET 9, there was a change[0] that improved load and store reordering to make it more likely that simple consecutive loads and stores are merged on ARM64. However, this change caused regressions in various hot paths because, for example, previously matched ldr w0, [addr], ldr w1, [addr+4] -> modify w0 -> str w0, [addr] pair got replaced with ldp w0, w1, [add] -> modify w0, str w0 [addr].

Turns out this kind of merging defeated store forwarding on Firestorm (and newer) as well as other ARM cores. The regression was subsequently fixed[1], but I think the parent comment author may have had scenarios like these in mind.

[0]: https://github.com/dotnet/runtime/pull/92768

[1]: https://github.com/dotnet/runtime/pull/105695

hajile 676 days ago

1. Why would you WANT to hit 5+GHz when the downsides of exponential power take over? High clocks aren't a feature -- they are a cope.

AMD/Intel maintain I-cache and maintain a uop cache kept in sync. Using a tiny part to pre-decode is different from a massive uop cache working as far in advance as possible in the hopes that your loops will keep you busy enough that your tiny 4-wide decoder doesn't become overwhelmed.

2. The float workload was always BS because you can't run nothing but floats. The integer workload had 22.1w total core power and 4.8w power for the decoder. 4.8/22.1 is 21.7%. Even the 1.8w float case is 8% of total core power. The only other argument would be that the study is wrong and 4.8w isn't actually just decoder power.

3. We're talking about worst cases here. Nothing stops ARM cores from creating a "work pool" of upcoming branches in priority order for them to decode if they run out of stuff on the main branch. This is the best of both worlds where you can be faster on the main branch AND still do the same branchy code trick too.

4. This is the tail wagging the dog (and something else if your numbers are correct). Complex x86 instructions have garbage performance, so they are avoided by the compiler. The problem is that you can't GUARANTEE those instructions will NEVER be used, so the mere specter of them forces complex algorithms all over the place where ARM can do more simple things.

In any case, your numbers raise a VERY interesting question about x86 being RISC under the hood.

Consider this. Say that we have 1024 bytes of ARM code (256 instructions). x86 is around 15% smaller (871.25 bytes) and with the longer 4.25 byte instruction average, x86 should have around 205 instructions. If ARM is generating 19.3% more uops than instructions, we have about 305 uops. x86 with just 4.7% more has 215 uops (the difference here is way outside any margins of error here).

If both are doing the same work, x86 uops must be in the range of 30% more complex. Given the limits of what an ALU can accomplish, we can say with certainty that x86 uops are doing SOMETHING that isn't the RISC they claim to be doing. Perhaps one could claim that x86 is doing some more sophisticated instructions in hardware, but that's a claim that would need to be substantiated (I don't know what ISA instructions you have that give a 15% advantage being done in hardware, but aren't already in the ARM ISA and I don't see ARM refusing to add circuitry for current instructions to the ALU if it could reduce uops by 15% either).

8. https://en.wikipedia.org/wiki/Peephole_optimization

The final optimization stage is basically heuristic find & replace. There could in theory be a mathematically provable "best instruction selection", but finding it would require trying every possible combination which isn't possible as long as P=NP holds true.

My favorite absurdity of x86 (though hardly the only one) is padding. You want to align function calls at cacheline boundaries, but that means padding the previous cache line with NOPs. Those NOPs translate into uops though. Instead, you take your basic, short instruction and pad it with useless bytes. Add a couple useless bytes to a bunch of instructions and you now have the right length to push the function over to the cache boundary without adding any NOPs.

But the issues go deeper. When do you use a REX prefix? You may want it so you can use 16 registers, but it also increases code size. REX2 with APX is going to increase this issue further where you must juggle when to use 8, 16, or 32 registers and when you should prefer the long REX2 because it has 3-register instructions. All kinds of weird tradeoffs exist throughout the system. Because the compilers optimize for the CPU and the CPU optimizes for the compiler, you can wind up in very weird places.

In an ISA like ARM, there isn't any code density weirdness to consider. In fact, there's very little weirdness at all. Write it the intuitive way and you're pretty much guaranteed to get good performance. Total time to work on the compiler is a zero-sum game given the limited number of experts. If you have to deal with these kinds of heuristic headaches, there's something else you can't be working on.

anvuong 677 days ago

That was true when ARM was first released, but over the years the decoder for ARM has gotten more and more complicated. Who would have guessed adding more specialized instructions would result in more complicated decoders? ARM now uses multi-stage decoders, just the same as x86.

IshKebab 677 days ago

Sure, but it's not idle power consumption that's the difference between these.

wmf 677 days ago

When a laptop gets 12 hours or more of battery life that's because it's 90% idle.

jeffbee 677 days ago

And while it's important to design a chip that can enter a deep idle state, the thing that differentiates one Windows laptop from the next is how many mistakes the BIOS writers made and whether the platform drivers work correctly. This is also why you cannot really judge the expected battery life under Linux by reading reviews of laptops running Windows.

double0jimb0 676 days ago

I didn’t watch this link, but my Zenbook S 16 only gets remotely close to my M2 MBA battery life if the zenbook is in whatever is Windows 11 ‘efficiency’ mode, and then it benchmarks at 50% of the M2.

I don’t think the two are remotely comparable in perf/watt.

cubefox 677 days ago

Unlike AMD and Qualcomm, Apple uses an expensive TSMC 3nm process, so you would expect better battery life from the "MBP3". I assume they used the process improvements to increase performance instead.

hajile 676 days ago

Perf per watt is higher for M1 on N5 vs Zen5 on N4P, so the problems go deeper than just process.

X Elite also beats AMD/Intel in perf/watt while being on the same N4P node as HX370.

https://www.notebookcheck.net/AMD-Zen-5-Strix-Point-CPU-anal...

cubefox 675 days ago

Performance per watt also depends on clock speed. Other things equal, higher clock speed means worse performance per watt.

GrumpyYoungMan 677 days ago

The display, RAM, and other peripherals are consuming power too. Short of running continuous high CPU loads, which most people don't do on laptops, changes in CPU efficiency have less apparent effect on battery life because it's only a fraction of overall power draw.

acchow 675 days ago

> within an hour of MBP3

Not a good way to measure. The Zenbook S16 has a larger 78Wh battery vs the MacBook Pro’s 69.6Wh.

So that’s 11% less battery life despite 12% more battery capacity.

halJordan 677 days ago

Yeah if you make a worse core and then downclock it then you will increase power efficiency. AMD thankfully only downclocks the 5c, but Intel is shipping ivy lake equivalents in their flagship products just to get power efficiency up.