Hacker News new | ask | show | jobs
by lars-b2018 2022 days ago
I think the bigger story here is that computing based on ARM64 has been, and continues to, surround the traditional x86-64 territory. ARM dominates phones, and is popping up in server environments as well, where it is cheaper to run for many workloads than X86-64. We have migrated to ARM based Gravitron processors to deliver our services at my company, and we have been able to reduce our AWS EC2 costs by 40% or so, with almost the same performance. The M1 based Macs certainly show the potential on the desktop. Other vendors will produce better, more competitive desktop power level chips, too, that are completely adequate for desktop computing experiences.

We are seeing the classic Innovator's Dilemma pattern whereby the lesser technology slowly overtakes the more brittle incumbent tech. x86 did the same with minicomputers, and then mainframe type workloads.

2 comments

ARM instances are usually 10% cheaper than their equivalent instance.

How exactly did you achieve such a dramatic reduction in cost? The graviton instances aren't faster than the Intel/AMD options, so you need at least the same number of instances unless something else changed.

I do not understand why ARM is considered lesser technology. RISC was always considered to be better then CISC in many academic circles.
Well, two main reasons:

1. Innovator's dilemma: https://en.wikipedia.org/wiki/The_Innovator%27s_Dilemma

ARM in this case is the underdog, attacking the incumbent x86. It is the lesser tech because it started from "below" (lower value niches not taken by the incumbent which prefers higher profit margins).

2. RISC vs CISC has never been settled. Until Apple (and Amazon super recently) produced their super recent architectures based on ARM, x86 was considered superior. Yes, RISC was theoretically better in the 80s and later 90s, but x86 has been reinvented successfully 2 times (micro-ops and x86_64) and scaled from 0.2MHz to 5Ghz. It's also a mix of CISC and RISC, it's not pure CISC now. Same for ARM, they've added various instructions which bring it closer to CISC.

As far as I can see the only real difference these days is the constant length of instructions for ARM: https://erik-engheim.medium.com/why-is-apples-m1-chip-so-fas...

Another major difference is the memory model. In X86 other CPU’s must always see the writes of a core exactly in the right order. This limits the ability to reorder store ops significantly. ARM requires a memory barrier for this. This is a major reason why X86 emulation is so slow. One must basically issue a memory barrier after every store op.

M1 actually also implements the X86 memory model too in HW. It’s only usable for the rosetta applications and comes with perhaps 20% perf penalty. But it’s still way better than emulating it with barriers.

In C++ terms it pretty much means X86 is always seq_cst. With ARM one can actually get benefit of the different memory model options. As an example one can do an atomic access without having to flush the whole store buffer out, which is impossible in X86.

Due to the instruction coding and memory model for multicore I don’t really see X86 dominating anymore in the upcoming decades.

And as the modern OoO cores are so similar internally it’s not even a big deal in the end. AMD shouldn’t have any issues with producing a Zen arm core. Switch the inst decoder and that’s pretty much it (ton of design work for sure). Keep the X86 mem model optional for emulation and binary translation can be almost thought as just making X86 instructions into fixed width ahead of time.

I am trying to wrap my head around whether ARM's looser memory model is a fundamental performance advantage or not.

I had always assumed that the looser memory model must have a performance benefit. But this comment from last week argues that it doesn't really buy that much, and that a bigger buffer can eliminate most of the difference: https://news.ycombinator.com/item?id=25263461

If TSO forces flushing of store buffers for every atomic access, that seems like a substantial disadvantage for x86.

It has to flush them. Because if another core sees the result of the atomic op it must also see everything else that the other core wrote before the op. While it can indeed first see no writes and then suddenly all it can never see just the atomic op and not the previous writes.

Without that the store buffers can be kept unflushed to, as an example, see if one can get a full cacheline or whatnot and only flush then.

The comment is correct that an X86 with heavy reordering backend will beat arm without one. However arm with one does handily beat X86 with one. Case in point: M1

Interesting, are you suggesting that a large part of the M1 performance advantage is thanks to the weaker ARM memory model?

Is the 20% perf hit of TSO mode that you cite an ARM vs. ARM comparison? If so, that would be pretty damning.

Is there an easy way to flip the M1 into TSO mode for benchmarking? I would love to observe this 20% for myself.

I haven't done exact measurements, but I don't think the cost of enabling TSO is anywhere near as high as 20%. On the contrary, I don't think I have noticed a real difference; perhaps it is but a couple percent slower.
Just want to add one thing, x86 has stronger memory “semantics”. So, it doesn’t have to work that way behind the scenes, just at the end of the block, it has to appear it worked that way. So, x86 does reordering, store combining etc a lot. IMHO, performance difference between arm vs x86 barely related with ISA, in M1 case, it’s definitely not, a lot more going on than just taking advantage of weaker memory model.
Having to appear worked that way does cause restrictions in multiprocessor case. ARM chips naturally do all of that too, with the memory model simply giving them way more freedom to reorder things.

One couldn’t do X86 version of M1, mostly because there is no way of making an instruction decoder that wide for it.

And the performance penalty of M1 when working in TSO mode strongly implies that yes the weaker memory model indeed plays a major role. Not the biggest, but definitely not insignificant. Tens of percents here and tens of percents there combined become a ridiculous perf boost.

I think "RISC" was, for a stretch of time, "superior", but it was always in very expensive and very niche workstations and servers.

x86 was itself a kind of underdog in that space, losing out in sheer performance and bandwidth to the high-end RISC market, but making up for it by, you know, being cheap enough that "regular people" could go and buy a family computer with an x86 processor.

Intel managed to slowly improve the various deficiencies over time, and produced a processor not only affordable, but also superior to the incumbent RISC chips in all but niche areas.

I think more than the ISA, Apple's M1 manages to be so good at its job for similar reasons to the old RISC chips -- it's built from the ground-up with a pretty specific target application, rather unlike x86, which has to be all things to all people, with legacy support and scaling from laptops to quad socket 3U servers.

The reinvention that gave x86 the advantage over (workstation) RISC was OoO execution in the Pentium.

It’s a bit silly to think of x86 and ARM as being so different these days. Most x86 code looks a lot like it was produced for a RISC chip and ARM has been gaining some more complex instructions and addressing modes. Academics may have felt that RISC was better than CISC for a long time but I don’t think they were predicting the world of today so much as they were incorrectly predicting the near past. If RISC were so much better then we’d have lots of workstations running on modern Alpha or PA-RISC systems. But we don’t see that.

Ah, yeah, forgot to mention Out of Order Execution.

> Academics may have felt that RISC was better than CISC for a long time but I don’t think they were predicting the world of today so much as they were incorrectly predicting the near past.

Plus you know, academics have been known to be wrong. That's why there's even a saying for it: science advances, one funeral at a time. People are emotional and get attached to their pet theories. There are many examples of this.

It's going to be interesting to see how CPU/GPU tech advances the next few years.

> Most x86 code looks a lot like it was produced for a RISC chip

Well yes I don't think the x86 BCD instructions are still being used liberally, or AAA, or CONS/SCAS etc

It was the PPro and OOO processors were being developed by lots of companies and teams at that time. Wikipedia indicates it was featured in the PowerPC 601 (93), SPARC64 (95), PPro (95), MIPS R10000 (96), etc. Hard to see where was the advantage over workstations on that point...
But ARM64 instructions are all fixed width still right?

And how specifically are these instructions complex? Do they produce an unusually high number of micro-ops when decided?

Yeah CISC vs RISC thing barely makes sens anymore. It could only matter in the context of hand-programmed and low frequency processors. What remains of "it" (at least in some people's mind) that is still relevant today on major ISA is clearly instruction encoding, but you could totally make a CISC with fixed length instruction. You could also make a RISC with (highly) variable length but... just why? And actually why make a highly variable length ISA at all regardless of if the ISA is RISC or CISC? The real reason that x86 has an highly variable length is only historical. When you only decoded one at a time it did not matter that much. So maybe you wanted some big ones for convenience but it would be a shame to make them all big. So the 8086 had instructions of 1 to 6 bytes (up to 10 with prefix?). And then you had 32-bits with 16-bits compat, it seems it went up to 15 bytes, and then stayed at 15 max for AMD64...

Today the most used x86 instructions are not that much different from ARM ones, and in a good number of cases actually even simpler. The simplest way to compare is to simply look at the assembly an optimizing native compiler emits, and lookup for what the emitted instructions are doing.

The micro-ops of processor are actually quite dependent on the ISA. You could do neutral micro-ops, but I doubt this would be very efficient. So you can't really compare the complexity of ISA by the number of micro-ops issued, because among different arch and microarch the micro-ops are themselves more complex on some point and less on others, with tons of similarities with (core instruction set of) the ISAs they implement.

Complex instruction for backward compat, special, or intrinsically difficult operations, are transformed to a potentially very high number of micro-ops, often looked-up in a ROM.

What aspect added to x86 is RISC like? It sounds a bit like an oxymoron. If you start with a complex instruction set, you cannot make it less complex by adding new instructions.
Generally, this is (poorly) used to refer to x86 decoding things to micro-ops.
>It's also a mix of CISC and RISC, it's not pure CISC now.

It's not. It's microcoded. Like VAX. Which RISC is a response to.

CISC and RISC are mostly an ISA concept.

Most chips are "microcoded", at least here and there. The term does not define a whole micro-architecture, esp in modern complex ones, but I'm not really sure how you would implement old school CISC chips without microcode concepts. The 8086 was microcoded: https://www.reenigne.org/blog/8086-microcode-disassembled/. It basically means that "you" program some low level internal details of a chip, e.g. the muxes connecting various buses to execution units. Most of the time the "you" is only the chip designers, and it is stored mostly in a ROM. Sometimes microcoding is accessible by actual software programmers, but it is quite rare.

And modern u-ops are not necessarily like what was done at the time of old-school microcode.

Micro ops?
That's two completely different things. Every x86 instruction is decomposed into one or several micro-ops that are executed. Hardly any x86 instructions are microcoded, which means that the decoder has to start streaming micro-ops from the microcode ROM.
This is a dynamic evaluation. Right now, ARM is a far superior technology to x86 for my smartphone use cases, and x86 is a massively superior technology for my high-detail PC gaming use cases.

(It's split for things like development, now that the M1 is here. Personally I get my development done with 8-core 4+ Ghz x86 chips in Windows/WSL2/Docker, but others are getting development done with Apple Silicon M1 ARM chips!)

Over time the "lesser" technology can become the greater technology for more use cases.

Is that latter case simply due to the availability of high performance GPU cards?

If so, as soon as someone (Apple) makes an ARM chip with sufficient PCIe lanes, then it's only a matter of drivers...

> Is that latter case simply due to the availability of high performance GPU cards?

It's a combination of things:

* Apple has not indicated that they'll allow for third-party GPUs

* Apple has not pushed for widespread gaming compatibility

* Game developers prefer to reach the widest audience

Apple would need the M1 successors superior in gaming computation and capable of marrying up to high-performance GPUs, write the drivers necessary AND get broad adoption of Apple gaming to factor into game publishing.

This is all theoretically possible, but it didn't happen while Apple was using roughly the same high-end hardware available to PC gamers, so it's questionable (in my mind) that it'll happen when they are locking down their hardware further, and trying to use all their own hardware for gaming.

Wasn't that simply because there was less software available for ARM? And the first ARM chips were quite anemic in performance.

That's not the case anymore, but the stereotype will persist for a while.

The first ARM chips were blazingly fast and made their competitors performance look anaemic - they have been around since the mid-1980s. The real difference is they went the low-power route while retaining as much performance as possible, rather than depending on desktop power supplies and industrial cooling.

One of the reasons this new chip is so performant is the "headroom" that this approach has given them.

I wish I understood better why Acorn Archimedes did not do better in the market. It looked like they had many great computer models.
Easy, IBM PC clones happened.

Apple are the survivors of the vertical integrated home computers, and they only managed to survive by reverse acquisition of NeXT, and diversifying outside of the desktop market.

One of the three major consoles is ARM based and one of the best selling systems of all time. From last gen, the PS Vita was ARM based and considered high end for mobile graphics at the time. Apple has already crushed Intel in on chip graphics and Nvidia is heavily invested in ARM.

I don’t think ARM and graphics will long be known for poor performance.

As a fan of AMD, I hope they see the writing on the wall and are planning accordingly. It would not be surprising to see the successors to the PS5 and Xbox Series running on ARM.

The complaint about ARM as I understood it was that they couldn't match Intel performance. The commentary around Apple seems to be that over the decade-plus where they iterated for iPhone and iPad, it's caught up