Hacker News new | ask | show | jobs
by avianes 1337 days ago
The difficulty is not to decode a single instruction, the difficulty is to decode multiple instructions in parallel (let's say from 5 to 8 instructions in parallel).

In a modern high performance processor instructions are decoded in batches: Decoding the first instruction is straightforward. But x86 instructions range from 1 to 15 bytes, therefore the second instruction can start from byte-offset 1 up to 15. 3rd instruction has a byte-offset ranging from 2 to 30, ans so on. Furthermore, figuring out an x86 instruction length requires reading several byte from the instruction.

In the end, the 8th instruction has 99 possible byte-offset, and assuming that we put, as you suggest, a decoder for each position and length, we need about 1590 decoders and many multiplexer to decode 8 full instructions per cycle.

Of course we don't do that, it would consume a lot of energy for nothing.

To handle that, modern x86 processor instruction decoding involves a instruction length decode before the instruction decode. The instruction length decode is responsible for identifying the instruction positions and boundaries, and this instruction length decode is a challenging part of the x86 processor to design. We don't know how Intel or AMD exactly do instruction length decode, but we know that some published techniques include a length predictor.

That's why, for simplicity and energy efficiency, instruction boundaries must be easily identified and the number of instruction lengths must be kept low.

1 comments

> In the end, the 8th instruction has 99 possible byte-offset, and assuming that we put, as you suggest, a decoder for each position and length, we need about 1590 decoders and many multiplexer to decode 8 full instructions per cycle.

Um... wat? No CPU tries to decode 99 bytes of memory in a cycle. ADL is at 32 currently, I believe. And the instruction starting at byte 12 doesn't change depending on anything but it's own data. It either exists (because the previous instruction ended on byte 11) or it doesn't. So you decode 32 instructions starting at each byte you've fetched (the last ones can be smaller subset engines because they don't need to decode longer instruction forms), and then mask them on or off based on earlier instruction state. Then feed your 1-32 decoded instructions through a mux tree to pack them and you're done.

Surely there's more complexity, since this is going to have to be pipelined in practice, and a depth of 32 is going to require something akin to a carry-lookahead adder instead of being chained.

But the combinatorics you're citing seem ridiculous, I don't understand that at all.

> Um... wat? No CPU tries to decode 99 bytes of memory in a cycle

Actually, no x86 processor decodes 8 instructions in parallel. This is an example to illustrate how the number of possible offsets scales with 15 instruction lengths.

> So you decode 32 instructions starting at each byte you've fetched

No you don't do that, it's too power consuming.

> But the combinatorics you're citing seem ridiculous, I don't understand that at all.

What I'm trying to explain is that decoding 8 instructions in parallel in x86 is hardly possible, while decoding 8 instructions (or more) from a RISC archi per cycle is never a problem

> No you don't do that, it's too power consuming.

Uh... yes you do? How else do you think it works? I'm not saying there's no opportunity for optimization (e.g. you only do this for main memory fetches and not uOp execution, pipeline it such that the full decode only happens a stage after length decisions, etc...), I'm saying that it isn't remotely an intractable power problem. Just draw it out: check the gates required for a 64->128 Dadda multiplier or 256 bit SIMD operation and compare with what you'd need here. It's noise.

And your citation of "8 instructions in parallel" seems suspicious. Did I just get trolled into a Apple vs. x86 flame war?

> Uh... yes you do? How else do you think it works?

No, I literally explain it in my first answer. The part about "1590 decoders" is irrelevant since a misunderstood your message (thinking that you are talking about using 16 decoders to decode the 16 instruction lengths of a single instruction).

But the rest on instruction length decode is how you actually do it.

> I'm saying that it isn't remotely an intractable power problem.

I mean, obviously, if you ignore all the power consumption issues of using 32 decoders in parallel and using only 5 of the results out of the 32. Then yes, there's no problem.

But in reality, yes it's a problem to decode many x86 instructions in parallel.

> Just draw it out: check the gates required for a 64->128 Dadda multiplier or 256 bit SIMD operation and compare with what you'd need here. It's noise.

Yes, the energy consumption of the multipliers is high, but I don't see how this is an argument to make an inefficient decoder? Also, a multiplier power consumption depends on transistor activity, and you can expect the MSB of the operand not to change too much. For decoder the transistor activity will be high.

> And your citation of "8 instructions in parallel" seems suspicious. Did I just get trolled into a Apple vs. x86 flame war?

Not a troll nor a flame war. I don't use Apple products, mainly because I don't agree with Apple practices. But actually choosing a RISC ISA allows them to decode a lot of instructions in parallel for little energy and complexity.

I chose 8 because it is the maximum that the mainstream will currently see. You might argue that 8 RISC instructions are not comparable with 8 CISC instructions, but even with say 4 CISC instructions it will still consume more energy

> You might argue that 8 RISC instructions are not comparable with 8 CISC instructions, but even with say 4 CISC instructions it will still consume more energy

Alder Lake decodes six. And again, your intuition about power costs here is just simply wrong. Instruction decode is Simply Not a major part of the power budget of a modern x86 CPU. It's not.

> And again, your intuition about power costs here is just simply wrong. Instruction decode is Simply Not a major part of the power budget of a modern x86 CPU. It's not.

I never said that instruction decode was a major part of the power budget.

And precisely, it is not because they don't decode 32 instructions in parallel. That's the purpose of an instruction length decoder prior to instruction decode.