| The difficulty is not to decode a single instruction, the difficulty is to decode multiple instructions in parallel (let's say from 5 to 8 instructions in parallel). In a modern high performance processor instructions are decoded in batches:
Decoding the first instruction is straightforward.
But x86 instructions range from 1 to 15 bytes, therefore the second instruction can start from byte-offset 1 up to 15.
3rd instruction has a byte-offset ranging from 2 to 30, ans so on.
Furthermore, figuring out an x86 instruction length requires reading several byte from the instruction. In the end, the 8th instruction has 99 possible byte-offset,
and assuming that we put, as you suggest, a decoder for each position and length, we need about 1590 decoders and many multiplexer to decode 8 full instructions per cycle. Of course we don't do that, it would consume a lot of energy for nothing. To handle that, modern x86 processor instruction decoding involves a instruction length decode before the instruction decode.
The instruction length decode is responsible for identifying the instruction positions and boundaries, and this instruction length decode is a challenging part of the x86 processor to design. We don't know how Intel or AMD exactly do instruction length decode, but we know that some published techniques include a length predictor. That's why, for simplicity and energy efficiency, instruction boundaries must be easily identified and the number of instruction lengths must be kept low. |
Um... wat? No CPU tries to decode 99 bytes of memory in a cycle. ADL is at 32 currently, I believe. And the instruction starting at byte 12 doesn't change depending on anything but it's own data. It either exists (because the previous instruction ended on byte 11) or it doesn't. So you decode 32 instructions starting at each byte you've fetched (the last ones can be smaller subset engines because they don't need to decode longer instruction forms), and then mask them on or off based on earlier instruction state. Then feed your 1-32 decoded instructions through a mux tree to pack them and you're done.
Surely there's more complexity, since this is going to have to be pipelined in practice, and a depth of 32 is going to require something akin to a carry-lookahead adder instead of being chained.
But the combinatorics you're citing seem ridiculous, I don't understand that at all.