Hacker News new | ask | show | jobs
by redraga 2023 days ago
I can't comment on the economics of it but I can comment on the technical difficulties. The issue for x86 cores is keeping the ROB fed with instructions - no point in building a huge OoO if you can't keep it fed with instructions.

Keeping the ROB full falls on the engineering of the front-end, and here is where CISC v RISC plays a role. The variable length of x86 has implications beyond decode. The BTB design becomes simpler with a RISC ISA since a branch can only lie in certain chunks in a fetched instruction cache line in a RISC design (not so in CISC). RISC also makes other aspects of BPU design simpler - but I digress. Bottom line, Intel and AMD might not have a large ROB due to inherent differences in the front-end which prevent larger size ROBs from being fed with instructions.

(Note that CISC definitely does have it's advantages - especially in large code foot-print server workloads where the dense packing of instructions help - but it might be hindered in typical desktop workloads)

Source: I've worked in front-end CPU micro-architecture research for ~5 years

1 comments

How do you feel about RISC-V compact instructions? The resulting code seems to be 10-15% smaller than x86 in practice (25-30% smaller than aarch64) while not requiring the weirdness and mode-switching associated with thumb or MIPS16e.

Has there actually been much research into increasing instruction density without significantly complicating decode?

Given the move toward wide decoders, has there been any work on the idea of using fixed-size instruction blocks and huffman encoding?

I can't really comment on the tradeoffs between specific ISAs since I've mainly worked on micro-arch research (which is ISA agnostic for most of the pipeline).

As for the questions on research into looking at decode complexity v instruction density tradeoff - I'm not aware of any recent work but you've got me excited to go dig up some papers now. I suspect any work done would be fairly old - back in the days when ISA research was active. Similar to compiler front-end work (think lex, yacc, grammar etc..) ISA research is not an active area currently. But maybe it's time to revisit it?

Also, I'm not sure if Huffman encoding is applicable to a fixed-size ISA. Wouldn't it be applicable only in a variable size ISA where you devote smaller size encoding to more frequent instructions?

Fixed instruction block was referring to the Huffman encoding. Something like 8 or 16kb per instruction block (perhaps set by a flag?). Compilers would have to optimize to stay within the block, but they optimize for sticking in L1 cache anyway.

Since we're going all-in on code density, let's go with a semi-accumulator 16-bit ISA. 8 bits for instructions, 8 bits for registers (with 32 total registers). We'll split into 5 bits and 3 bits. 5-bits gives access to all registers since quite a few are either read-only (zero register, flag register) or write occasionally (stack pointer, instruction counter). The remaining 3 bits specify 8 registers that can be the write target. There will be slightly more moves, but that just means that moves compress better and seems like it should enforce certain register patterns being used more frequently which is also better for compression.

We can take advantage of having 2 separate domains (one for each byte) to create 2 separate Huffman trees. In the worst case, it seems like we increase our code size, but in more typical cases where we're using just a few instructions a lot and using a few registers a lot, the output size should be smaller. While our worst-case lookup would be 8 deep, more domain-specific lookup would probably be more likely to keep the depth lower. In addition, two trees means we can process each instruction in parallel.

As a final optimization tradeoff, I believe you could do a modified Huffman that always encoded a fixed number of bits (eg, 2, 4, 6, or 8) which would half theoretical decode time at the expense of an extra bit on some encodings. it would be +25% for 3-bit encoding, but only 16% for 5-bit encoding (perhaps step 2, 3, 4, 6, 8). For even wider decode, we could trade off a little more by forcing the compiler to ensure that each Huffman encoding breaks evenly every N bytes so we can setup multiple encoders in advance. This would probably add quite a bit to compiling time, but would be a huge performance and scaling advantage.

Immediates are where things get a little strange. The biggest problem is that the immediate value is basically random so it messes up encoding, but otherwise it messes with data fetching. The best solution seems to be replacing the 5-bit register address with either 5 bits of data or 6 bits (one implied) of jump immediate.

Never gave it too much thought before now, but it's an interesting exercise.