Hacker News new | ask | show | jobs
Why x86 doesnt need to die (chipsandcheese.com)
78 points by ylk1 815 days ago
12 comments

> x86-64 CPUs keep real mode around so that operating systems can keep booting in the same way ... It’s part of the PC compatibility ecosystem that gives x86 CPUs unmatched compatibility and longevity.

This imo is one of the biggest advantages of x86 currently, at least as a hobbyist. In comparison to ARM based computers (like the raspberry pi for example) where the boot process is different for each device, and usually involves proprietary binaries which the user has no clue of how they work

In comparison, you could re-use, update, and repurpose any old x86 machine to do whatever you need.

The really annoying thing is that we're so close to doing better - openfirmware is decades old, and if we must throw that away UEFI is in fact portable; we could have UEFI ARM machines with nice normal busses that the OS can enumerate and boot just like x86. But, y'know, that would cost another 10 cents a board so we get to live with the current trash. (I mean, this is even a thing that we do use to boot VMs and Windows on ARM, and AIUI ex. https://libre.computer/ does use UEFI firmware, the adoption is just super limited)
Opened comments to make sure this was mentioned, X86S (formerly X86-S or X86-Simplified). Getting rid of all the old compat modes & booting straight to 64-bit. See: Intel Continues Prepping The Linux Kernel For X86S, https://www.phoronix.com/news/Linux-6.9-More-X86S . Also mentioned by chipsandcheese: Of course, compatibility can’t be maintained forever. ISAs have to evolve. AMD and Intel probably want to save some money by reducing the validation work needed to support real mode. Intel is already planning to drop real mode.

Personally I think ditching the old real mode systems will be a big boon to hobbyists, not a hindrance (sorry mode 13h users)! Linux/x86 Boot Protocol docs tentatively support this assertion (https://www.kernel.org/doc/html/v6.9-rc1/arch/x86/boot.html). What is helpful is having ACPI and UEFI and other conventions/standards in place.

ARM does have Base Boot Requirements (https://arm-software.github.io/ebbr/), that builds to something vaguely x86 like, but wow there's so many systems that still use hardcoded device trees. I haven't spent that much time, but just a couple hours is all it takes to figure out that uncompressing your dtb, applying an overlay, and recompiling a new dtb is awful & terrible & no way to compute. ARM is used so heavily in consumer devices that it's hard to see what would compel the greater ecosystem to do the right thing, to reform. I also can go read a deck like https://uefi.org/sites/default/files/resources/UEFI%20and%20... (UEFI and ACPI in Arm System Architecture) and appreciate, yeah, well, trying to be compatible & a good citizen is hard; there's specifications on top of specifications on top of specifications (Wei lists 17) to make it happen. x86 has benefited form a history of everyone tending towards intercompatibility, but there's nothing else in computing that's ever had such a strong overriding cooperation motive before.

Hardcoded DTBs are still prevalent because ACPI is a fudging mess that has no business being codified further in any standard. Anyone involved should focus their effort on a modern replacement; maybe something WASM-based, because at least that would have a chance of including a well-delimited API and runtime.
That's the best part of x86-64. From a laptop to a large number of enterprise servers (or at least any I use) you're essentially dealing with the same stuff. I think Pi's were the first non x86-64 architecture I'd dealt with to any degree.

OK- there were some SPARC servers but that was a while ago - and they were honestly never any fun.

I don't think the instruction encoding is a significant problem. Cache coherency really might be.

A current x64 chip is a dozen or so separate dies with eight or so x64 cores per die, with a couple of those in different sockets. When one thread on one code decides to write to a cache line, the memory model makes really strong guarantees about cores on some other socket noticing that change.

Arm doesn't have to go with total store order. GPUs involve distinct blocks of memory with their own invariants on when caches are invalidated at potentially very coarse granularity (like no change will be seen until after a kernel has finished executing, where a kernel is essentially a process that sprung to life and then did arbitrary amounts of maths).

Fast x64 code is prone to carefully partitioning the problem across different cores and trying not to hit a cache from another core but even then you still have something like MOESI sitting in the background waiting just in case some thread mutates the instructions executing on another one.

This misses on an important bit: parallel decoding of instructions. It is a lot harder with variable-length instrs where the length cannot even be calculated from the first byte - you need to read 10 bytes in the worst case to find an instr's len in x86. In aarch64 you need to read 0 bytes to know the length - it is 4

This matters in the way it interacts with i-cache. In aarch64 with 64-byte cache lines, one cache line is 16 instrs. always. In x86 that cache line could contain only 3 whole instrs. So unless your core is able to ingest over one icache line per cycle (intel cores currently are NOT), you are thus limited.

https://chipsandcheese.com/2021/07/13/arm-or-x86-isa-doesnt-...

>Another oft-repeated truism is that x86 has a significant ‘decode tax’ handicap. ARM uses fixed length instructions, while x86’s instructions vary in length. Because you have to determine the length of one instruction before knowing where the next begins, decoding x86 instructions in parallel is more difficult. This is a disadvantage for x86, yet it doesn’t really matter for high performance CPUs because in Jim Keller’s words:

>For a while we thought variable-length instructions were really hard to decode. But we keep figuring out how to do that. … So fixed-length instructions seem really nice when you’re building little baby computers, but if you’re building a really big computer, to predict or to figure out where all the instructions are, it isn’t dominating the die. So it doesn’t matter that much.

>...

>Researchers agree too. In 2016, a study supported by the Helsinki Institute of Physics[2] looked at Intel’s Haswell microarchitecture. There, Hiriki et al. estimated that Haswell’s decoder consumed 3-10% of package power. The study concluded that “the x86-64 instruction set is not a major hindrance in producing an energy-efficient processor architecture.”

I did not talk about power - i talked about perf. No modern x86 chip can decode 6 or 7 of these long instrs per cycle. there are aarch64 chips that can
Perhaps it's compensated by the fact a single x86 instruction does more? If a bunch of those aarch64 instructions would be loads and stores, but for x86 they're part of the arithmetic instructions, then it maybe doesn't matter?
What impact does it have on the overall performance though? Keller's argument is that the effect is small/negligible.
Keller's argument (as stated) is that it doesn't take up much die space. Hiriki's argument is that it doesn't consume much power. Neither addresses dmitrygr's argument, which is about performance and bottlenecks. (It could use very little power and very little space and still be a very big bottleneck.)

That doesn't mean that dmitrygr is correct. It means that everyone trying to answer him is arguing about the wrong thing.

The main issue with that argument is that the L1i cache can never realistically be exhausted fast enough to form a bottleneck, as long as the decoder is working ahead of the start of the execution pipeline.

The hard limit on instruction size is 15 bytes, so a 64-byte cache line will always be able to store at least 4 of them. (Or 3 plus the tail of an instruction from a previous line.) Meanwhile, on the other end, Intel cores can only retire up to 4 μops per cycle. Since each instruction takes at least 1 μop (except for macro-fusion, which only works on short instructions), retirement will always form a bottleneck before decoding can.

And in realistic code where you'd actually see these long instructions, i.e., hot SIMD loops, all the decoded instructions would stay warm and toasty in the μop cache (allegedly holding 6 fixed-size μops per cache line) after the first iteration.

> It could use very little power and very little space and still be a very big bottleneck.

I believe in chip design, this doesn't really happen (often). You can optimize the bottlenecks by allocating it more space and power.

I interpret Keller's statement indirectly - given that modern x86 CPUs dedicate only a small part of its circuitry to decoding logic means that it's not a bottleneck (otherwise there would be more circuitry for it).

The total architectural difference is pretty small in general. Like, say switching a chip from Intel to ARM lets you make it 30% faster. For the last several decades that was insignificant. Not so much these days though.

The decode difficulty may make a 5% difference, but add in the other things people have mentioned and maybe it adds up to 30%. (numbers pulled out of my arse)

Do you have benchmarks showing this? People would switch to ARM if this is true. Note Linux and Windows runs just fine ARM.
I believe that modern x86 processors store decoded micro-ops in the I$ instruction cache.

I always understood micro-ops to be fixed length.

Sure, you have to decode the variable length instructions at some point. But that extra work, relative to aarch64, is in practice amortized over the lifetime of that cache line.

That's not how it works for either Intel or AMD's current designs. Both use an L1 I$ which consists of encoded instructions, while adding a small uop cache (sometimes called L0) for recently decoded instructions.[1][2]

Intel's Netburst architecture stored decoded instruction sequences in its L1 cache, which Intel called a trace cache.[3] This didn't work out too well, so Intel reverted to a conventional L1 cache with the successor Merom[4] and introduced the uop cache shortly after with Sandy Bridge[5].

[1]https://chipsandcheese.com/2021/12/02/popping-the-hood-on-go...

[2]https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-fron...

[3]https://chipsandcheese.com/2022/06/17/intels-netburst-failur...

[4]https://chipsandcheese.com/2023/02/05/intels-dunnington-core...

[5]https://chipsandcheese.com/2023/08/04/sandy-bridge-setting-i...

> So unless your core is able to ingest over one icache line per cycle (intel cores currently are NOT), you are thus limited.

Do Intel cores no longer have a μop cache in front of the L1i cache?

>It is a lot harder with variable-length instrs where the length cannot even be calculated from the first byte - you need to read 10 bytes in the worst case to find an instr's len in x86. In aarch64 you need to read 0 bytes to know the length - it is 4

x86's approach to variable-length instructions is unfortunate.

In contrast, RISC-V leverages variable-length encoding to get the best code density among 64bit ISAs while sidestepping the instruction boundary problem.

(I digress, but note that while for the 32bit ISA RISC-V code density was competitive yet bested by ARM thumb2, it has since improved; RISC-V has the best density overall)

The length of a RISC-V instruction is in the first byte though, not the tenth.

Note that RISC-V's code density with the C extension is in bytes, not in number of instructions. The core integer ISA was designed to be extensible from small embedded MCUs, so every other chip has to use it. High-performance RISC-V cores depend a lot on macro-op fusion to run as fast as 64-bit ARM.

>not in number of instructions.

This comes up very often, but is an unfounded concern. Not only is instruction count competitively low, but as it turns out, critical paths of inter-dependent instructions are, at worst (w/o fusion nor 2019+ extensions), no worse than aarch64[0].

>The core integer ISA was designed to be extensible from small embedded MCUs, so every other chip has to use it.

There's so much to unpack here. Firstly, the ISA, as documented in the specification itself[1], is described as "An ISA separated into a small base integer ISA, usable by itself as a base for customized accelerators or for educational purposes, and optional standard extensions, to support general-purpose software development." Note there's no reference to small embedded MCUs in there.

Furthermore, the spec elaborates "An ISA that avoids “over-architecting” for a particular microarchitecture style (e.g., mi- crocoded, in-order, decoupled, out-of-order) or implementation technology (e.g., full-custom ASIC, FPGA), but which allows efficient implementation in any of these.".

>High-performance RISC-V cores depend a lot on macro-op fusion to run as fast as 64-bit ARM.

First news. There seems to be some confusion here. 64-bit ARM (aarch64) is implemented in a range of microarchitectures, targeting different uses. I will go ahead and assume (for convenience) that you meant specifically very high performance implementations, as used in workstations and servers.

These tend to be superscalar and very wide (ARM M1 and Tenstorrent Ascalon are 8-wide). Their execution units tend to be simpler, and instead there's more of them and some can only do specific tasks. Typically, for these macro-op fuse-able instructions, an ARM microarchitecture will have to emit multiple micro-ops, whereas in RISC-V they already come as separate instructions.

0. https://dl.acm.org/doi/pdf/10.1145/3624062.3624233

1. https://riscv.org/technical/specifications/ (unprivileged architecture)

I think you are missing the only point of the article: performance and compatibility are important; everything else is just aesthetics.

As long as Intel can produce fast CPUs, with new features and while maintaining support for the existing binaries, everything is OK. Fixed or variable length, that's a matter for Intel engineers: users could, and should, care less.

Most important applications have an ARM version now. Especially true since Apple Silicon and AWS Graviton. Windows will force developers to compile both x86 and ARM versions.
It's a nice theory but I don't think it holds up. X64 executes from a micro op cache and there's no particular reason to expect the ops in that to be variable length encoded. Thus it only goes to the i-cache when that misses, at which point you've spent long enough digging around in the cache that the extra decoding probably doesn't matter.

It's of like saying x64 is limited by only having 16 registers - there's only names for 16ish in the ISA, but there's loads more registers in the machine as part of hiding latency.

>It's of like saying x64 is limited by only having 16 registers - there's only names for 16ish in the ISA, but there's loads more registers in the machine as part of hiding latency.

Why not have just one, then?

After all, there's loads more registers in the machine as part of hiding latency.

The ISA either matters or it does not. Pick one.

usually the really fat instructions take over 1 cycle anyway, right? so the decoder should be able to keep up
pipelining...

they are usually piplineable

Those pipelines come with an area cost, a power cost and a latency cost.
Ignoring the power-guzzling data centres running Xeons at work, and talking as a layperson using a laptop, my older Intel MacBook Pro gives me 2-3 hours of battery life and heats up like a toaster, while my M2 MacBook Pro runs cool, and lasts a couple of days under moderate to heavy use before going flat. That's a huge win for me.
days of heavy use??? That's really impressive!

I haven't paid any attention to new PC hardware other than the RasPi in the last 5 years or so, and I've always ignored Apple, so I was really not expecting that much progress!

Moderate to heavy use. Not 100% heavy use. The official specs claim 15-18 hours battery life. Practically, I'd say I do about 8 hours a day and the Mac lasts a couple of days before the battery goes to single-digit percentages. With lighter use, I see nearly a week of battery life.

[Mine is an M2 16-inch MBP from last year, perhaps the M3's are somewhat better?]

In comparison, my new Dell work laptop with an Intel chip gives me about 4 hours. It's not an apples <-> apples comparison but they're in the ballpark.

Same here. Disconnected the charger when moving stuff in the office, worked a full day and halfway through the next one the laptop started complaining.

It's impressive. Nothing on the market comes even close.

x86 doesn't "need to die". long lasting designs that earned their keep and have proven are valuable. so from that standpoint, i agree with the premise.

the problem is that the article makes too much reliance on bad arguments such as "ISA differences were swept aside by the resources a company could put behind designing a chip". this is a dangerously bad argument. the fact that a company can afford to and is willing to keep x86 afloat and competitive with massive resources is not an argument for dismissal, but for its economic usefulness.

the value to a legacy ISA is real. but the cost is complexity. this complexity drives $, silicon area, power. period. in a even drag race, the simpler more efficient design would definitely win.

and a load-store architecture is simpler than a complex addressing scheme and will have more throughput per clock with less resources (design work, validation, $, area, power). a fixed or simple variable length opcode is always going to be simpler to implement than x86!

but on the other hand, a lot of those massive costs (NRE) are sunk costs. others are not.

so there's nothing wrong with x86 at the moment - it's still clearly the cheapest (due to scale) and fastest (definitely per/$) and excepting M2/M3 also fastest absolute per-core.

it certainly doesn't need to "die".

part of the advantage is inertia. and that's part of the disadvantage too. it's just barely starting to look like the trajectory is changing. but by the time the overall economics of ARM, RISC-V, etc. begin to overtake x86, the inertia and cruft will be negatively affecting them too.

things get old and die on their own. there's no stopping it. but in this case it doesn't need to be hastened and it won't be a "happy" moment when the changing of the guard does happen.

The nice thing about x86 is that it's so standardized. Everyone has been copying each other since the IBM PC era.

For a long time everything else was stuck with custom built images for every device.

> For a long time everything else was stuck with custom built images for every device.

Like today with ARM boards, where each OS is custom-bent to each board.

I gather many of those are mainlined now. Bootloaders might still be needed, but I would imagine the bulk of the customisations for a couple "major" platforms is sorted out.
I would not call it sorted out, more like workaround. The problem will grow larger and larger until it will become unmanageable.
Long-term, correct and fast emulation of x86 and ARM on other platforms is going to be damn important, features and bugs especially, for investment utilization and for long-term archival and historical purposes.

Apple did an excellent job with Rosetta 2 in most cases. It has its limitations since it's not 100% or sufficiently general as to replicate a Windows PC.

One approach that didn't work so well was Transmeta with VLIW and pouring resource-costly optimizations into the compiler.

All-in-all, CISC/RISC debate is a mirage because it depends on the net performance of the macro ISA running on some particular micro ISA. We don't have single-cycle, non-pipelined RISC or low cycle efficiency, hyper deep, 6 GHz PC processors anymore for good reason... they've been supplanted by a series of incremental computer organizational design approaches due to healthy competition. Now we have low energy ARM, blinding-fast laptops from the top 3 vendors, and ridiculous server metal like the 9754 and the 9474F for 6 TiB 2P systems.

>All-in-all, CISC/RISC debate is a mirage because it depends on the net performance of

This ignores performance is not everything.

Complexity doesn't just mean more effort; it also translates into an increase of bugs.

The problem is not the ISA — it’s the whole ecosystem.

How many CPUID flags exist? There are so many interdependencies, it’s even hard to say what even makes sense, without detailed knowledge. SSE without MMX? The reuse of floating point HW for other stuff is also a mess.

A x86 system is a witches’ brew of MSRs, I/O ports, and chipset-specific PCI devices. And that’s just across only Intel CPUs…

How much code has to execute before even a bootloader can run?

Why do we need a damn ACPI interpreter?

Why do we still deal with legacy PCI routing (on all devices) when none actually use it?

The PCI configuration space is a bit of a mess. We should just make a new standard where everything 64bit and memory-mapped only.

Why are we shackled with slow IO port operations that replicate hardware from the era when leaded gasoline was widely available? Some may say this is legacy, yet we still rely on it today!!

Imagine if 1980s systems were shackled with 1940s-1960s compatibility concerns.

We need to start afresh — take all the learnings from the past decades and cast off the legacy crap.

Most of these gripes exist in all modern ecosystems as a consequence of heterogenous and expansive markets of computer hardware. SoC-land is much, much worse. As an exercise, I'd recommend you try porting 9front to a random unsupported SBC. Personally, after a month of reverse engineering and running into really fun and cool hardware bugs, I gave up.

> Why do we need a damn ACPI interpreter?

Because ACPI is still used and for good reason. The lack of an equivalent in virtually every other ISA ecosystem is enough to laugh them out of the room whenever anyone suggests they're a viable alternative.

No thanks. Keep your ridiculous blackbox that's actively hostile to having new software run on it.

> SoC-land is much, much worse

Exactly. It does not matter what core architecture is being used. What matters is that each system usually has different memory model, which is completely defeating any compatibility

We will when you are willing to pay for the massive cost of redoing everything which already works well. Also, are you going to buy everyone new hardware for your new better architecture?

Note that we would probably end up with something which has just as many problems or maybe even worse. Rewrites are very hard and you really need to know what you are doing to get everything right or to even to make things better.

> Why are we shackled with slow IO port operations that replicate hardware from the era when leaded gasoline was widely available? Some may say this is legacy, yet we still rely on it today!!

Does any modern x86 wake up knowing it's a modern x86 or do they all still wake up thinking they are 8086's and progressively wake up from a series of nested nightmares until it realizes the truth that it has more registers, 64-bits and SIMD instructions?

Yesterday I was looking at a server board and it had the two-digit POST display. I assume it is updated with those ancient OUT instructions and attached to the last vestiges of an 8-bit ISA PC bus running at 5v TTL levels.

Today they still boot in 16-bit real mode. Very quickly, it usually switches to 32bit mode for the BIOS, maybe eventually 64bit mode.

Option ROMs still start in real mode (at least for non-UEFI). The system management (SMI) handlers are still launched in real mode but too!

When the OS gets launched, l it brings up the other CPUs in real mode (INIT/SIPI), and has to do all the same gymnastics again…

Even PCI configuration space access is still supported with IO ports and that mode still required for some aspects.

Even today, the serial port still uses IO ports and interrupts just as it did 40+ years ago…

Yep, IO based POST codes are still a thing…

One of the reasons why the RISC vs CISC debate keeps popping up every few years, is that we kind of stopped naming CPU uarches after the CISC and RISC terminology was introduced in the late 70s.

And because there wasn't any new names, everyone got stuck in this never ending RISC vs CISC debate.

As ChipsAndCheese points out, the uArch diagrams of modern high-performance ARM and x86 cores look very similar. And the real answer, is that both designs are neither RISC or CISC (the fact that one implements a CISC-derived ISA and the other implements a RISC-like ISA is irrelevant to the actual microarchtecture).

So what is this unnamed uarch pattern?

Mitch Alsup (who dwells on the comp.arch newsgroup) calls them GBOoO (Great Big Out-of-Order). And I quite like that name, guess I just need to convince everyone else in the computer software and computer hardware industry to adopt it too.

The GBOoO design pattern focuses on Out-of-Order execution to a somewhat insane degree.

They have massive reorder buffers (or similar structures) which allow hundreds of instructions to be in-flight at once, with complex schedulers tracking dependencies so they can dispatch uops to their execution units as soon as possible. Most designs today can disaptch at least 8 uops per cycle, and I've one design capable of reaching peaks of 14 uops dispatched per cycle.

To feed this out-of-order monster GBOoO designs have complex frontends. Even the smallest GBOoO designs can decode at least three instructions per cycle. Apples latest CPUs in the M1/M2 can decode eight instructions per cycle. Alternatively, uop caches are common (especially on x86 designs, but some ARM cores have them too), bypassing any instruction decoding bottlenecks.

GBOoO designs are strongly reliant on accurate branch predictors. With hundreds of instructions in flight, the frontend is often miles ahead of finalised instruction pointer. That in-flight instruction window might cross hundreds of loop iterators, or cover a dozen function calls/returns. Not only do these branch predictors reach high levels of accuracy (usually well above 90%), and can track and predict complex patterns, and indirect patters, but they can actually predict multiple branches per cycle (for really short loops).

Why do GBOoO designs aim for such insane levels of Out-of-Order execution? Partly its about doing more work in parallel. But the primary motivation is memory latency hiding. GBOoO designs want to race forwards and find memory load instructions as soon as possible, so they can be sent off to the complex multi-level cache hierarchy that GBOoO designs are always paired with.

If an in-order uarch ever misses the L1 cache, then the CPU pipeline is guaranteed to stall. Even if an L2 cache exists, it's only going to minimise the length of the stall.

But because GBOoO designs issue memory requests so early, there is a decent chance the L2 cache (or even L3 cache) can service the miss before the execution unit even needed that data (though I really doubt any GBOoO design can completely bridge a last-level cache miss).

Where did GBOoO come from?

From what I can tell, the early x86 Out-of-order designs (Intel's Pentium Pro/Pentium II, AMD's K6/K7) were the first to stumble on this GBOoO uarch design pattern. Or at least the first mass-market designs. I'm not 100% these early examples fully qualify as GBOoO, they only had reorder buffers large enough for a few dozen instructions, and the designers were drawn to the pattern because GBOoO's decoupled frontend and backend allowed them to push through bottlenecks caused by x86's legacy CISC instruction set.

But as the designs evolved (lets just ignore Intel's misadventures with netburst), the x86 designs of the mid 2000's (like the Core 2 Duo) were clearly GBOoO, and taking full advantage of GBOoO's abilities to hide memory latency. By 2010, we were staring to see ARM cores that were clearly taking notes and switching to GBOoO style designs.

>As ChipsAndCheese points out, the uArch diagrams of modern high-performance ARM and x86 cores look very similar.

So far so good.

>And the real answer, is that both designs are neither RISC or CISC

This is... Not even wrong.

>(the fact that one implements a CISC-derived ISA and the other implements a RISC-like ISA is irrelevant to the actual microarchtecture).

Exactly. CISC and RISC are characteristics of the ISA, not the microarchitecture.

But note (and I can't stress this enough), this does not mean ISA doesn't matter.

The ISA is the interface between software and hardware. A well designed ISA will e.g.:

- Not restrict the actual design of the microarchitecture.

- Not expose microarchitecture artifacts.

- Not force unjustified complexity into the microarchitecture nor the software.

Yes, good points.

It’s worth remembering that the surviving CISC architectures (x86 and z/370) are less CISCy than VAX and 68k were, in terms of number of address operands and complexity of addressing modes. And ARM is not a classically RISCy RISC. Instruction sets seem to have converged on a pragmatic middle ground — except for RISCV :-)

>except for RISCV :-)

Oh, it also is (very) pragmatic. It's just the sort of pragmatism culture that from the outside gets often misunderstood as "purism" :-)

Title is misspelled, should be “doesn't”.
x86 doesn't need to die because there's nothing wrong with it (or at least, many of its issues are either not talked about or misrepresented.) Marketing hype and people who buy into that have established and cemented a narrative where the despotic x86 chains us to higher power draws and unreasonable architecture choices that we could otherwise do away with if only we could adopt the noble ARM magic silicon.

In truth ARM doesn't actually present any real gains (efficiency or otherwise) over x86 in pretty much any space as an inherent consequence of its ISA. The narrative has its origins seeded by way of the principle market that ARM found widespread success in being.... microprocessors and extremely low-end processor like those found in handheld gaming devices and eventually phones. Naturally these processors were designed to sip voltage by way of not actually pushing a whole lot of numbers. The market matured and so did the architecture, and we started to see cellphones that could really sling their weight! It's all smoke and mirrors though, as even in TYOOL 2024 the moment you do something intensive that your phone does not have a hardware accelerator for (eg, compiling software) it becomes apparent the thing you're holding in your hand is about as good as a core 2 duo when it comes to crunching numbers with a lot of branches. Then of course Apple came along and brought ARM back to the desktop space after decades of being relegated to power-sippers. People's jaws hit the floor over a chip that doesn't actually perform any better than its AMD counterparts, because oh hey! 20 hour battery life! Well, actually it's 11 hours if you're doing really light web browsing, and only in Safari. Well hey, that's slightly better than the comparable laptop chips in the x86 family released around the same time, right? And indeed, that's a few hours more than I got in my first gen AMD T14 which by all metrics is close enough to the M1 chip in my mbp. But you know, the more I dug futher, the more I found out that the battery life was about comparable when the system actually has to start doing work. Long video calls in Zoom? That was about 4 hours on both. Heavy use of Firefox? 6 hours each. Lots of compiling and a resource heavy dev environment with a fat C++ language server? Again, about 4 hours on both. The battery gains weren't from some mystical discrepency between how instructions are decoded on the two chips. In the end it just came down to the fact that as all operating systems do, the great M1 battery life was owed to really cute power management drivers in Apple's operating system (as well as implementing a fair amount of extensions to make things like video decode and javascript execution draw less power.) It's a fact that becomes all the more apparent when, while following the Asahi devblog, during the watershed moment of actually getting Linux to bootstrap itself on Apple silicon, I read the kernel's main loop doing absolutely nothing chewed through the whole battery in less than 3 hours. That sounds about right to me, given every single power draw experiment I've read between 2020 and now indicates the M1's power draw isn't really as great as the hype machine has made it out to be.

I'm sure we're all at the edge of our seats hoping to experience an efficiency revolution, and that this can be given to us with this one weird trick of changing ISAs... but it's not happening. At least not with ARM, or RISC-V or any other contemporary architecture that isn't x86.

Far be it from me to defend Intel or their frankenstein's monster architecture, but I've gotten a bit tired of this dream that we're on the cusp of performing supertasks in an instant, at the cost of mere picowatts. Especially not when it inadvertantly pushes us towards a future of SoC lock-in and hobbyist bare metal development becomes an even bigger waking nightmare. Until then, I sincerely hope x86 never dies, even though saying so kills a piece of me inside.

>I'm sure we're all at the edge of our seats hoping to experience an efficiency revolution, and that this can be given to us with this one weird trick of changing ISAs... but it's not happening. At least not with ARM, or RISC-V or any other contemporary architecture that isn't x86.

You seem to miss that RISC-V being better is besides the point.

RISC-V's massive success was unavoidable, due to its open specification and free license.

The ISA did not need to even be good, just decent.

Most CISC proponents (armchair digital architects) entirely miss the point.

An ISA is the interface between hardware and software. Thus a complex ISA does impose complexity upon both the hardware and the software.

Complexity is inherently (very) bad, and thus needs strong justification.

RISC embodies this idea by recognizing the value of simplicity and requiring any ISA addition be weighted against its complexity cost.

Implementations of RISC philosophy ISAs demonstrate (by achieving or even surpassing parity) the complexity in x86 is not justified, and this is why there hasn't been any tabula rasa CISC architecture worth noting in several decades.

I am going to be blunt. If CISC is so bad, why did almost all of the RISC chips from the 1980s and 1990s fail? Why aren't we using them today? The closest you will get is ARM chips. If you are going to claim RISC is fundamentally better, why aren't the fastest and most power efficient chips RISC chips? Why are Amazon, Google, and Microsoft buying an enormous number of x64 chips? It's not because they love the architecture. It's because x64 chips are the best in terms of cost, power usage, and performance.

My guess is the most important thing for chip performance is the manufacturing process. After that, it's things like pipelining, branch predictors, super-scaler design, etc. (I am not an expert and this is just a guess). I don't think instruction set really matters that much when chips have billions of transistors.

RISC was a great idea in the 1970s because a more complex instruction set meant fewer transistors for performance improvements. The same was also true in the 1980s. By 1995-1996, the Pentium Pro was the fastest 32-bit chip. At this point, RISC's proponents had to start explaining why a better instruction set did not translate into a faster chip. They never did. Instead, they keep on banging on the "RISC is better" drum without supplying better chips.

One thing to remember is that chips are complex and defy simple binary classification. Even Intel thought that CISC was on the way out, although they were going down a somewhat extreme path with EPIC, but the most successful approach turned out to be a hybrid where complex CISC instructions were broken into RISC-like micro-ops. That got Intel back in the game with the Pentium Pro getting close enough to the DEC Alpha’s performance lead and with the advantage of not having to recompile everything in an era where that was orders of magnitude harder than it is now. I wouldn’t say either side won since that has been going back and forth for decades now.

It’s also hard to separate that from other factors: was the Pentium more successful than the PowerPC because of CISC or because Intel had much better fabs than Motorola? If Motorola, IBM, DEC, or HP had had less incompetent management at the time it’s possible that we might remember this period very differently.

>Even Intel thought that CISC was on the way out, although they were going down a somewhat extreme path with EPIC, but the most successful approach turned out to be a hybrid where complex CISC instructions were broken into RISC-like micro-ops.

Note that neither are these micro-ops RISC (they are long, complex and specific to the chip, actually much closer to EPIC), nor was this micro-ops approach new.

Intel tried to use the 64 bit transition to finally abandon x86.

It almost managed to do this, but AMD saw a chance and AMD64 happened, ironically leveraging the x86 software moat against Intel. Software would not migrate to Itanium, but smoothly transition to AMD64.

Without the moat, Itanium was doomed. But it was doomed either way, as was found later, due to its complexity. Complicating compilers, having learned nothing from the RISC paper.

> Note that neither are these micro-ops RISC (they are long, complex and specific to the chip, actually much closer to EPIC), nor was this micro-ops approach new.

I was mostly thinking about them as simpler - which is not the same as simple - but mostly in the larger context of it not being a religion where chip designers pick a side and never budge, when in reality everyone finds ways to use good ideas that make sense for their designs.

>mostly in the larger context of it not being a religion where chip designers pick a side and never budge, when in reality everyone finds ways to use good ideas that make sense for their designs.

Chip designers design chips (I suspect you meant to emphasize microarchitectures), whereas ISA designers design ISAs.

An ISA designer's concern is to design a good ISA[0]. A good ISA will be chosen and loved by microarchitecture designers. This enthusiasm might then come off as religion-like.

0. https://news.ycombinator.com/item?id=39848530

>why did almost all of the RISC chips from the 1980s and 1990s fail?

Citation needed.

>Why aren't we using them today?

Because we are using better chips made recently, not the ones from the 80s and 90s.

>Why are Amazon, Google, and Microsoft buying an enormous number of x64 chips?

Because performance vs cost in the current market, as well as access to x86 software moat.

But this is changing. Notably, Amazon has Graviton, Microsoft was Windows for ARM with grease for x86 software, and Google has a digital design team, which is already iterating RISC-V based accelerators.

Facebook, FAANG you have not mentioned, has its own RISC-V server effort.

>I don't think instruction set really matters that much

And yet, you're writing this very opinionated comment about ISAs.

>RISC was a great idea

Yes, it was. This is why the industry did never again make a tabula rasa CISC ISA.

>At this point, RISC's proponents had to start explaining why a better instruction set did not translate into a faster chip.

The RISC chips actually were faster. But this did not matter, as Intel had the better fabs, and the cash.

So Intel was able bruteforce its way into enough performance for cheap enough that the market would then not bother going through the pain of switching ISAs.

>without supplying better chips.

The chips were better despite Intel's fab advantage. But they were not cheaper, nor did it run the software the market wanted to run.

They sure sold these Pentiums, and were able to buy (and kill) Alpha later.

The one and only reason x86 survives to date is this software moat.

This moat advantage is in danger now, thanks to Microsoft's efforts to detach Windows from x86 and provide emulation to handle the transition like Apple did.

You asked for a citation about why almost all RISC chips from the 80s and 90s failed. Well, here it is:

- ARM (ARM) - ARM has done very well. It traditionally focused on low power chips. Over the past two decades, ARM has created faster and faster chips. Its chips can now be used in laptop, desktop, and server systems. I would argue ARM is a success. I would also argue it's not clear that ARM systems are faster than Intel or AMD systems.

- Alpha (Digital Equipment / Compaq / Hewlett Packard) - Alpha is dead. No one makes or sells Alpha computers any more.

- MIPS / Silicon Graphics - Silicon Graphics is gone. MIPS may survive as an embedded chip. MIPS chips are not used in used in main stream servers nor do they out preform x86 chips from AMD or Intel.

- PA RISC (Hewlett Packard) - HP dropped this in favor of Intel's Itanium chips.

- POWER (IBM) - This architecture is still being sold by IBM in real products. My guess is it is a good chip but it is still more expensive and slower than chips from AMD and Intel. Still, IBM deserves a lot of credit for keeping POWER going for over 30 years and for outlasting all of the other server/workstation RISC chips from the 1980s and early 1990s (ARM was not in servers or workstations during this time period). One sign that POWER is slower than Intel and AMD is there are no POWER chip benchmark results for the integer speed tests on the SPEC web site (https://www.spec.org/cpu2017/results/cint2017.html). This implies that IBM's cores are slower than x86 cores.

- PowerPC (Motorola and IBM) - This may survive as an embedded chip. It is not used in servers or desktop/laptop computers. Apple was the only user and they switched to Intel in the 2000s (they also recently switched from Intel to their own custom ARM Mx chips).

- SPARC (Sun Microsystems) - The microprocessor design team radically downsized sometime in the last decade (https://www.networkworld.com/article/964265/the-sun-sets-on-...). It still has support but my guess is the chips are not fast or cheap. The only reason you would use it is if you had software you could not move to x86 or ARM systems.

We had 7 RISC chip architectures launch in 1980s and 1990s. 6 were used in workstations and servers. One (ARM) was primary used in low power systems in the 1990s. In 2024, 2 are still going (ARM and POWER), 1 is on life support (SPARC), 2 may be used in some embedded systems (MIPS and POWER PC), and 2 are dead (Alpha and PA RISC).

ARM chips may be able to outperform x86 chips but I have seen very little evidence of this (Apple's M1, M2, and M3 may be better but it is hard to tell without reliable independent benchmarks). POWER is probably a good chip but it has failed to beat x86's performance. The rest are irrelevant in the laptop, desktop, and server world.

My main point is RISC's proponents promised that RISC chips would be faster than CISC chips like the x86/x64 chips. In 2024, we have not seen any evidence that the fastest chip must be a RISC chip. In fact, we have seen the opposite where none of the RISC architecture's hold the undisputed performance crown. If RISC was better, at least one of the RISC chip designs would have defeated AMD/Intel by now.

>ARM success

Yes.

>Alpha (Digital Equipment / Compaq / Hewlett Packard) - Alpha is dead. No one makes or sells Alpha computers any more.

Important to mention: Intel bought the Alpha IP, thus ensuring Alpha is no longer a problem to them.

>MIPS chips are not used in used in main stream servers nor do they out preform x86 chips from AMD or Intel.

Important to mention: MIPS abandoned MIPS ISA in favor of RISC-V, which doesn't hold the workstation/server performance crown... yet. Anytime now.

>POWER (IBM) - This architecture is still being sold by IBM in real products. My guess is it is a good chip but it is still more expensive and slower than chips from AMD and Intel.

Emphasis on expensive. They would be competitive, except price means they are not.

>SPARC

Register Window was a bad idea after all. Also, Oracle. Might as well be dead.

>We had 7 RISC chip architectures launch in 1980s and 1990s.

In the list, yes. There's more. But it is important to observe that's because the industry stopped launching CISC ones.

And from the CISC ones, only x86 remains.

>My main point is RISC's proponents promised that RISC chips would be faster than CISC chips

There is no such promise in RISC whitepaper. Citation needed as to where this promise is found (and who these RISC proponents are).

>If RISC was better, at least one of the RISC chip designs would have defeated AMD/Intel by now.

No, it does not follow. Furthermore, AMD and Intel can and have made RISC chips before, and will likely make RISC chips again.

A far better question is "has RISC succeeded?", and here the answer is yes. For decades, there have been no tabula rasa CISC ISAs, and among the remaining ones only x86 has some life to it.

Meanwhile, RISC ISAs drive smartphones, the computers that most people use the most, as well as Mac computers, the most prominent alternative to Windows computers in the market.