Hacker News new | ask | show | jobs
by dzaima 676 days ago
Expanding on 3: I think it ends up at O(n^2 * log n) transistors, O(log n) critical path (not sure on routing or what fan-out issues might there be).

Basically: determine end of instruction at each byte (trivial but expensive). Determine end of two instructions at each byte via end2[i]=end[end[i]]. Then end4[i]=end2[end2[i]], etc, log times.

That's essentially log(n) shuffles. With 32-byte/cycle decode that's roughy five 'vpermb ymm's, which is rather expensive (though various forms of shortcuts should exist - for the larger layers direct chasing is probably feasible, and for the smaller ones some special-casing of single-byte instructions could work).

And, actually, given the mention of O(log n)-transistor shuffles at http://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardo..., it might even just be O(n * log^2(n)) transistors.

Importantly, x86 itself plays no part in the non-trivial part. It applies equqlly to the RISC-V compressed extension, just with a smaller constant.

1 comments

Determining the end of a RISC-V instruction requires checking two bits and you have the knowledge that no instruction exceeds 4 bytes or uses less than 2 bytes.

x86 requires checking for a REX, REX2, VEX, EVEX, etc prefix. Then you must check for either 1 or 2 instruction bytes. Then you must check for the existence of a register byte, how many immediate byte(s), and if you use a scaled index byte. Then if a register byte exists, you must check it for any displacement bytes to get your final instruction length total.

RISC-V starts with a small complexity then multiplies it by a small amount. x86 starts with a high complexity then multiplies it by a big amount. The real world difference here is large.

As I pointed out elsewhere ARM's A715 dropped support for aarch32 (which is still far easier to decode than x86) and cut decoder size by 75% while increasing raw decoder count by 20%. The decoder penalties of bad ISA design extend beyond finding instruction boundaries.

I don't disagree that the real-world difference is massive; that much is pretty clear. I'm just pointing out that, as far as I can tell, it's all just a question of a constant factor, it's just massive. I've written half of a basic x86 decoder in regular imperative code, handling just the baseline general-purpose legacy encoding instructions (determines length correctly, and determines opcode & operand values to some extent), and that was already much.