| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by avianes 1374 days ago

> Um... wat? No CPU tries to decode 99 bytes of memory in a cycle

Actually, no x86 processor decodes 8 instructions in parallel. This is an example to illustrate how the number of possible offsets scales with 15 instruction lengths.

> So you decode 32 instructions starting at each byte you've fetched

No you don't do that, it's too power consuming.

> But the combinatorics you're citing seem ridiculous, I don't understand that at all.

What I'm trying to explain is that decoding 8 instructions in parallel in x86 is hardly possible, while decoding 8 instructions (or more) from a RISC archi per cycle is never a problem

1 comments

ajross 1374 days ago

> No you don't do that, it's too power consuming.

Uh... yes you do? How else do you think it works? I'm not saying there's no opportunity for optimization (e.g. you only do this for main memory fetches and not uOp execution, pipeline it such that the full decode only happens a stage after length decisions, etc...), I'm saying that it isn't remotely an intractable power problem. Just draw it out: check the gates required for a 64->128 Dadda multiplier or 256 bit SIMD operation and compare with what you'd need here. It's noise.

And your citation of "8 instructions in parallel" seems suspicious. Did I just get trolled into a Apple vs. x86 flame war?

link

avianes 1374 days ago

> Uh... yes you do? How else do you think it works?

No, I literally explain it in my first answer. The part about "1590 decoders" is irrelevant since a misunderstood your message (thinking that you are talking about using 16 decoders to decode the 16 instruction lengths of a single instruction).

But the rest on instruction length decode is how you actually do it.

> I'm saying that it isn't remotely an intractable power problem.

I mean, obviously, if you ignore all the power consumption issues of using 32 decoders in parallel and using only 5 of the results out of the 32. Then yes, there's no problem.

But in reality, yes it's a problem to decode many x86 instructions in parallel.

> Just draw it out: check the gates required for a 64->128 Dadda multiplier or 256 bit SIMD operation and compare with what you'd need here. It's noise.

Yes, the energy consumption of the multipliers is high, but I don't see how this is an argument to make an inefficient decoder? Also, a multiplier power consumption depends on transistor activity, and you can expect the MSB of the operand not to change too much. For decoder the transistor activity will be high.

> And your citation of "8 instructions in parallel" seems suspicious. Did I just get trolled into a Apple vs. x86 flame war?

Not a troll nor a flame war. I don't use Apple products, mainly because I don't agree with Apple practices. But actually choosing a RISC ISA allows them to decode a lot of instructions in parallel for little energy and complexity.

I chose 8 because it is the maximum that the mainstream will currently see. You might argue that 8 RISC instructions are not comparable with 8 CISC instructions, but even with say 4 CISC instructions it will still consume more energy

link

ajross 1374 days ago

> You might argue that 8 RISC instructions are not comparable with 8 CISC instructions, but even with say 4 CISC instructions it will still consume more energy

Alder Lake decodes six. And again, your intuition about power costs here is just simply wrong. Instruction decode is Simply Not a major part of the power budget of a modern x86 CPU. It's not.

link

avianes 1374 days ago

> And again, your intuition about power costs here is just simply wrong. Instruction decode is Simply Not a major part of the power budget of a modern x86 CPU. It's not.

I never said that instruction decode was a major part of the power budget.

And precisely, it is not because they don't decode 32 instructions in parallel. That's the purpose of an instruction length decoder prior to instruction decode.

link