Hacker News new | ask | show | jobs
by kjhghjmkedfcv 5928 days ago
The RISC/CISC things is a little simplified. One reason RISC has never caught on in the desktop is memory speed hasn't kept up with CPU speed (and can't - with the laws of physics). So if a RISC cpu takes 10 instructions to do what a CISC can do in 1, it loses any speed advantage if it takes 10x as long to get the next instruction from memory.

The principle reason people use ARM is low power, part of it's low power comes from the RISC design but it's not as simple as that. To reach the same overall performance as an x86 the RISC may have to use more power, simply because power increases faster than clock frequency.

5 comments

The difference in RISC/CISC instruction count is closer to 2:1 than 10:1. (Unless you are using a VAX polynomial evaluation opcode, but that is an extreme.)

ARM ameliorates this by having multiple instruction sets. The Thumb instructions are a denser encoding, if somewhat slower. 90/10 rules apply.

Thumb instructions are the same speed, but can only perform a subset of what the ARM instruction set can; each instruction takes half the space. Thumb can be faster if it eliminates cache overflows, but can also be a lot slower if faster ARM instructions have to be emulated with Thumb equivalents.
> it loses any speed advantage if it takes 10x as long to get the next instruction from memory.

I'm just thinking out loud... but what if instructions in memory were simply compressed, and the CU's decode step were a decompression algorithm, rather than lots of opcode-specific lookups? It would still be a RISC processor, basically, just with a decompression coprocessor.

Compression wouldn't be much use if it were applied one opcode at a time, so I suppose you'd have to either read the code one block at a time, which could make jumps very slow, or the compiler and instruction decoder would have to do somewhat crazy stuff to turn code paths into compressed blocks.
Main memory is already read a block at a time anyway, to get the gains we all expect for space locality. I'm imagining the blocks (probably equivalent to memory pages, in practice) would be kept uncompressed in L1/2 cache memory, with an additional layer of cache added on top for compressed blocks. Then, a near jump would be a read on a low-cache hit, and a decode on a high-cache hit, while a long jump would be a page-fault+decode as usual.
Kind of, but not exactly.

The amount you'd need to increase the clock of a RISC CPU to get similar performance to desktop CISC CPUs is a lot less than the 2.4GHz your currently using.

The shorter pipelines and simpler cycles means that you don't need to make the CPU crazy-fast to boost performance as much as you would with CISC. Intel has improved that starting with the Core CPUs, but it's still not as good as RISC design. The really deep pipeline in the P4 series was a killer - the cores were churning 3.6GHz and still not getting much work done. The "density of work" in a RISC cycle is (was) much higher than in CISC, and that really does help keep the power consumption down to a minimum.

Hence the advantages of a CISC frontend and a RISC backend, as x86 has evolved to. An x86 might pull one instruction from memory. translate it into 10 backend operations, and get the best of both worlds.
… at the expense of extra computation and power consumption. And I think the current CISC backends would better by called VLIW.

There's more than one way to skin a cat, and people can talk all day about what to call each one.

Absolutely. It's a real shame AMD beat Intel to the x64 punch, or we'd all be running Itanium today. VLIW nevermore....
No, the problem with Itanium is/was that it's no-one knew how to write good compilers for them. Also, VLIW is incredibly close to the hardware and thus bound to be outdated very quickly. It didn't help that IA64 has about 40bits / instruction and thus the least instruction / byte of all the mainstream architectures.

Transmeta had the right idea, they did the x86-to-VLIW dynamically at runtime and in software, coupled with proper code caches. But it seems they were to early -- the market wasn't ready for them.

So if a RISC cpu takes 10 instructions to do what a CISC can do in 1, it loses any speed advantage if it takes 10x as long to get the next instruction from memory.

99% of the time instructions come from the instruction cache, not from memory. And as jws said, it's not 10:1.

Yes and in practice ARM isn't really RISC and x86 isn't really CISC, with VLIW and pipelines and caches it's more complex.

But the original RISC research was in a time when neither CPU clocks nor memory bandwidth was anywhere near physical limits.

It's not the RISC(Apple) is clever and CISC(intel) is a dumb dinosaur - message the article is aiming at.