| > I am under the impression that the design was heavily data driven, building on decades of industry experience in many different sectors and actually providing efficient instructions for the operations that are used the most in real code. Yeah, that's the impression I get too. I also get the impression they were planning ahead for the very wide GBOoO designs (I think Apple had quite a bit of influence on the design and they were already working a very wide GBOoO microarch), so there is a bias towards a very dense fixed-width encoding, at the expense of increased decoding complexity. ARM weren't even targeting the ultra low end, as they have a completely different -M ISA for that. This is in contrast to RISC-V. Not only do they target the entire range from ultra low end to high performance, but the resulting ISA feels like it has a biased towards ultra low gate count designs (the way immediates are encoded are points towards this). --------------- You might hate me for this, but I have to raise the question: Does AArch64 actually count as a RISC ISA? It might have the fixed width encoding and load/store arch we typically associate with RISC ISAs. But there is one major difference that arguably disqualifies it on a technicality. All the historic RISC ISAs were designed in parallel with the first generation microarchitecture of the first CPU and where hyper-optimised for that microarchitecture (often to a fault, leaving limited room for expansion and introducing "mistakes" like branch delay slots). Such ISAs were usually very simple to decode, which lead to the famous problems that RISC had with code density. I lean towards the opinion that this tight coupling between ISA and RISC microarchitecture is another fundamental aspect of a RISC ISA. But AAarch64 was apparently designed by committee, independent of any single microarchitecture. And they apparently had a strong focus on code density. The result is something that is notably different from any other RISC ISA. You could make a similar argument about RISC-V, it was also designed by committee, independent of any single microarchitecture. But they also did so with an explicit intention make a RISC ISA, and the end result feels very RISC to me. > that perhaps even uses something like software-aided decoding (which appears to be more power-efficient than pure hardware decoding) At this point, I have very little hope for software-aided decoding. Transmeta tried it, Intel kind of tried it with software x86 emulation on Itanium, Nvidia bought the Transmeta patents and tried it again with Denver. None of these attempts worked well, so I kind of have to conclude it's a flawed idea. Though the flaw was probably the statically scheduled VLIW arch they were translating into. Maybe if you limited your software decoding to just taking off the rough edges of x86 instruction encoding it could be a net win. |
That's the brilliance of it all, IMO. They didn't have to target the ultra low end, since they already had an ISA that works perfectly in that segment (the -M is basically a dumbed down ARMv7, but most of the ecosystem came for free).
Unlike...
> This is in contrast to RISC-V.
...which is a mistake IMO. When you want to be good at everything, you're not the best at anything.
Footnote: I am convinced that we'll see a fork (of sorts) of RISC-V for the GBOoO segment. Since the ISA is free, companies will do what they see fit with it. Possible candidates for this are NVIDIA and Apple (both known for wanting full control of their HW and SW), or possibly some Chinese company that needs a strong platform now that the US is putting in all kinds of trade restrictions (incl. RISC-V https://techwireasia.com/10/2023/do-risc-v-restrictions-sign...).
> And they apparently had a strong focus on code density.
This is also very interesting. At first you'd think that they abandoned code density when they dropped thumb, went for fixed width 32-bit instructions, and even dropped ldm/stm, but when you look closer at the ISA you realize that AArch64 code is very dense. From what I've seen it's usually denser than x86 code. I attribute a large part of this to many key instructions that appear to have been designed for cracking (e.g. ldp x29, x30, [sp], 48 - looks suspiciously crackable, it's really 2-3 instructions in one, and it has three (!) outputs).
So...
> You might hate me for this, but I have to raise the question:
Not at all :-)
> Does AArch64 actually count as a RISC ISA?
No. It has a few obvious traits of a RISC (it's clearly a load-store ISA with many registers), but it also has instructions that clearly were not designed with a 1:1 mapping from architecture to microarchitecture (which I would expect in a RISC ISA).
> I lean towards the opinion that this tight coupling between ISA and RISC microarchitecture is another fundamental aspect of a RISC ISA.
Yes, you have a point. When instructions are designed to decode into a single "internal instruction" that occupies a single slot in the pipeline (one instruction = one clock), we have a RISC ISA. That does not say anything about pipeline topology, though (e.g. parallelism, pipeline lengths, branch delays, etc).
There are obviously edge cases and exceptions that make a precise definition tricky. For instance, I consider MRISC32 to be a RISC ISA, but an implementation may expand vector instructions into multiple operations that take several clock cycles to complete (not entirely different from ldm/stm in ARMv7).
I think that we need to accept a sliding scale, instead of requiring hard limits. Hence the term "RISC-style", rather than "RISC". Perhaps some kind of a Venn diagram would make better sense :-)
> None of these attempts worked well ... the flaw was probably the statically scheduled VLIW
I still think that there is potential in software-aided decoding/translation (heck, most of the software that we run on a daily basis is JIT-translated, so it can't be that bad).
I think that Transmeta failed largely because they initially aimed at a different segment (high performance) but had to adjust their targets to a segment that wasn't really mature at the time (low power), and they didn't have a backup for catering to high performance needs.
However... If software decoding is only used for small power efficient cores (and maybe they use something else than VLIW?) that live together with proper high end cores on a SoC, then I think that the situation would be completely different.
IIRC a main driver for Transmeta was to circumvent x86 licensing issues, so even if big/little would be feasible at the time and they had the skills and manpower to pull off a high end x86 core, that would probably not be an option.