Hacker News new | ask | show | jobs
by titzer 2023 days ago
Lots of other comments point out the vertical integration.

For raw single-thread performance:

1. ARM64 is a fixed-width instruction set, so their frontend can decode more instructions in parallel.

2. They got one honking monster of an out-of-order execution engine. (630 entries), which feed:

3. 16 execution ports.

1 comments

I don't fully grasp assembly, instruction sets, and how CPUs work so pardon the silly questions.

I think I understand 1) as since they know the width they can more accurately divide the instructions to more parallel executers (whatever they are - the execution ports?)

2) I believe this allows more "pre-work" to get done before it's actually needed, but then the "pre-work" just chills until

3) these things do the work, and there an abnormally high amount of them?

p.s. Any noob friendly reading is also appreciated!

For 1), just think of instructions of little bundles of bytes. The CPU runs through the instructions in forward order, jumping around to other bits of the code as it goes. X86 has variable-width instructions (i.e from 1 byte up to 17 bytes--X86 is complex and there are a lot of prefix bytes that have been used to add new functionality over the years). To determine how long an instruction is, you need to decode the bits of the instruction. For ARM64, and most other ISAs nowadays, the instructions are all 4 bytes long. That means they can all be decoded in parallel.

For 2, imagine a boa-constrictor swallowing a huge piece of prey. One mouth (CPU: the frontend) and one rear (CPU: the retirement phase). The instructions go in the front end in the program order. They are decoded into operations that pile up in the middle (the giant bulge in the boa constrictor). When an instruction is ready to go, one of the execution ports (3--think of 16 little stomachs) picks up an instruction and executes it. Then at the backend, the retirement phase, instructions are committed in the order they appeared in the original program, so that the program computes the same result.

By making basically all of the pieces of this boa constrictor bigger and more numerous, it eats a lot more instructions per clock (on average). Making that bulge (the reorder buffer) huge allows the CPU to have high chance of some useful work to feed to one of its 16 stomachs.

I think it's easy to underestimate how much difference (1) makes. Take the famous line "thequickbrownfoxjumpsoverthelazydog" - and think how you'd parse that out programatically. You'd start at the start, reading each character in, comparing it against a dictionary, and when you decide you have a whole word - then you can split that word out - and then continue on to the next.

But you can't really do this in parallel as the start for each word depends on the previous split already being known.

If it was simply law that every word in existence was 5 characters, you could parse this out with zero lookups, zero knowledge. "accurately" isn't so much the issue, it's that you have to decode each instruction to know where the next starts.

Yup, you've got the basic ideas. Hennessy and Patterson's books are the standard rec. "Computer Organization and Design" one is version more targeted at developers, and "Computer Architecture: A Quantitative Approach" is more focused on CE's or people that will be getting more into the guts.