Hacker News new | ask | show | jobs
by idividebyzero 2043 days ago
I think it moderately depends on the definition you give it to. If you require RISC to be a load/store architecture, x86 is not even close to be one. Also, aarch64 is a variable-length instructions set and include complex instructions (such as those to perform AES operations). Compiler optimizations are meant to be taken advantage by all architectures, regardless of RISC/CISC.
4 comments

Personally, I think the RISC/CISC "question" isn't really meaningful anymore, and it's not the right lens with which to compare modern architectures. Partially, this is because the modern prototypes of RISC and CISC--ARM/AArch64 and x86-64, respectively--show a lot more convergent evolution and blurriness than the architectures at the time the terms were first coined.

Instead, the real question is microarchitectural. First, what are the actual capabilities of your ALUs, how are they pipelined, and how many of them are there? Next, how good are you at moving stuff into and out of them--the memory subsystem, branch prediction, reorder buffers, register renaming, etc. The ISA only matters insofar as it controls how well you can dispatch into your microarchitecture.

It's important to note how many of the RISC ideas haven't caught on. The actual "small" part of the instruction set, for example, is discarded by modern architectures (bring on the MUL and DIV instructions!). Designing your ISA to let you avoid pipeline complexity (e.g., branch slots) also fell out of favor. The general notion of "let's push hardware complexity to the compiler" tends to fail because it turns out that hardware complexity lets you take advantage of dynamic opportunities that the compiler fundamentally cannot do statically.

The RISC/CISC framing of the debate is unhelpful in that it draws people's attention to rather more superficial aspects of processor design instead of the aspects that matter more for performance.

> It's important to note how many of the RISC ideas haven't caught on.

2-in, 1-out didn't, either. Nowadays all floating-point units support 3-in, 1-out via fused multiply-add. SVE provides a mask argument to almost everything.

Unless you're using a definition I'm not familiar with aarch64 isn't a variable length instruction set - here's Richard Grisenthwaite Arm's lead architect introducing ARMv8 - the slide here confirms "New Fixed Length Instruction Set":

https://youtu.be/GBeEEfmJ3NI?t=570

I understand that they refer to it as a fixed-length instruction set, it's correct, note though that not all ARMv8 instructions are 4 bytes long. Indeed, some instructions that are met together are fused to a single one, or SVE, for instance, introduces prefix; so practically, this means that sometimes instructions can be 8 bytes long.
Macro-op fusion of the MOVW/MOVT family doesn't count. At the time of that presentation, SVE didn't exist. Even now, the masked move instruction in SVE can also stand on its own as a single instruction and sometimes it does get emitted as its own uop.
Thanks, yes of course. I guess probably fair to say that philosophically it's fixed-length, in way that the original Arm was RISC, i.e. with some very non RISC-y instruction. Very different to x86 though.
64-bit Arm is fixed width. Modern 32-bit Arm was not fixed width, as Thumb-2 was widely used.
The main difference is x86 decode is hell to parallelize, as you have no idea where instructions start or end. It's a linear dependency chain of instruction lengths, an antipattern in the modern parallel processing world. Modern x86 CPUs have to use a large number of tricks and silicon to deal with this decently.

While even with Thumb-2, you can at worst just try decoding an instruction at every halfword. At worst you throw away half of the results if they are the second half of an instruction that was already taken care of. If you tried to do the same thing with x86 you'd throw away many more results, trying to decode (much more complex encodings) at every byte.

Is it really so hard to find instruction length in x86? State machines are associative, and therefore you can build a reduction tree for parallel processing of them. And the state machine itself isn't too bad: it's mostly prefixes, and figuring out if the opcode uses a ModR/M byte (which most do) or has an immediate operand. And while x86 does have a nasty habit of packing multiple instructions into a single opcode (via specific register values in the ModR/M byte), I believe all of them would share the same behavior in the immediate operand effects.

I suspect that in one pipeline stage, you could at least resolve the entire cacheline into the individual instruction boundaries that can be simultaneously issued into uops, if not having the entire instruction decoded into the hardware fields. You wouldn't know if register 7 referred to a general purpose register, or a debug register, or an xmm reg, or whatnot, but you'd probably know that it was a register 7.

And after you know each instruction boundary, now you have to do a massive mux from positions in the cache line to separate decoders. As I understand, that's a big part of the problem, and essentially costs more than a single pipeline stage.
x86 is certainly not RISC by any sane definition. It is still one of the least complex historical CISCs.