Hacker News new | ask | show | jobs
by wolf550e 2850 days ago
Not every workload is memory bandwidth bound like his "make -j16" compile. Some workloads need memory latency or fast inter-core (and inter-socket) operations (e.g. RDBMS OLTP), some need CPU throughput (e.g. HPC), some need best possible single thread CPU performance (e.g. some gaming).

As he wrote, CPUs are most efficient (compute per Watt) at a specific frequency, and if his CPU mostly waits for RAM, this can be done at low power.

It's probably possible to create x86-64 CPUs with narrower backends (fewer execution units) with microcode-emulated 128 and 256 bit registers/operations (and maybe even emulated FPU) and get a cheaper and faster build server, if it was economical to fab such narrow-use-case chips (those would be good for redis/memcached too I imagine).

3 comments

> Not every workload is memory bandwidth bound like his "make -j16" compile.

He actually did `make -j32`, not 16. Which is going to absolutely devastate the cache.

`make -j<number of cores x 2>` was a good rule of thumb back when you had 1/2/4 physical CPUs with their own sockets on a motherboard and spinning rust hard disks. A lot of "compilation" time was reading the source code off the disk. But it doesn't make any sense anymore with so many cores, hyperthreading, and SSDs that serve you the file in milliseconds.

If he's bandwidth limited, he would gain a significant performance improvement by reducing the number of processes.

I'd reserve judgement until I saw measurements. Maybe all 32 jobs are using the same cpp/gcc/asm/ld binaries (or whatever the stack is these days) which never get evicted. And I presume this is running under DragonFly BSD whose design goals include aggressive SMP support, so things might be different there. I don't know.
Zen is already light on vector units, and microcodes 256-bit operations.

It's certainly possible to build a more lightweight core, but most of that work is reducing the complexity of the out-of-order machinery. The FPU+ALU is under a quarter of each Zen core. https://en.wikichip.org/w/images/c/cb/amd_zen_core_%28annota...

It is definitely not "microcoded" - 256-bit operations are just sent in halves to the 128-bit ALUs and combined for the final answer.

Don't get mixed up between "microcoding" and "micro-op" - the latter is something different, slower and which usually requires some kind of transition in the decoders and uop caches to start reading microcoded ops. The latter is the "normal" or "fast" mode for the CPU and just because one instruction turns into two uops (or macro-ops or whatever AMD calls them) doens't mean microcoded.

Sorry, I meant "the former is something different..."
You do want it to out-of-order and branch predict and speculate enough to issue speculative RAM reads as soon as a possibly needed address is available, to hide RAM latency (as long as rollbacks of speculatively executed operations hide the loaded values in the cache so the speculation leaves no side effects), that is important for performance.

In that picture (thanks!) I see the FPU is big, the decoder is big, the branch predictor is big, the rest is probably needed. Maybe emulated FPU is good for some workloads, maybe ability to program in microinstructions instead of x86-64 is useful too. But maybe silicon area is not the expensive thing (dark silicon, etc.).

Do you have a link about microcoded avx256? I would think it would be way too slow.
> 256-bit vector instructions (AVX instructions) are split into two micro-ops handling 128 bits each.

https://www.agner.org/optimize/blog/read.php?i=838#838

This is responsible for Ryzen losing to Intel in SIMD heavy benchmarks. The upside is that it avoids the reduced turbo boost Intel does for some 256-bit AVX instructions (and even worse downclocks caused by AVX 512), so for workloads mixing avx and normal instructions it shouldn't do too badly.

I see, yep - but this is still hardwired stuff happening in instruction decode, as Agner writes, not trapping to microcode sequencing.
Is there a good resource that explains the difference between those ("decode" vs. "trapping") on a modern CPU? When I see "trap" I imagine the kernel catching illegal instruction exceptions and emulating them in software, but it doesn't seem like that's what you mean?
Modern CPUs execute micro ops, RISC-like instructions (e.g. load from memory address to register, add two registers, store from register to memory address). The CPU's "decode" stage translates x86 instructions into micro-ops (often 1-to-1, but x86 compare followed by x86 jump are translated into a single micro-op, while some x86 instructions are translated into multiple micro-ops).

On one CPU model a x86 operation like "256bit add" might translate into "256bit add" micro-op, and on another model the same x86 operation might be translated into a series of micro-ops like "128bit add, wait a cycle for the 1st add to finish, pass the carry bit into a 2nd 128 bit add", because that model doesn't have a real 256bit adder. So the latency of the operation is 2 cycles, but nothing else is changed.

Some x86 instructions might be very complicated and cannot be translated into a fixed-length series of micro-ops using a template. For example, the integer division, square root or the string compare machine instructions might be loops with conditionals in them and don't run the same amount of micro-ops every time. They can be implemented by Intel using a program written in micro-ops. Intel stores this program in flash on the CPU and the decoder knows to run that program when encountering the instruction. The OS doesn't need to help here, this is not emulation or software-floating point, it's just that the single instruction takes 200 clock cycles. What this does to the out-of-order engine is another story. These "programs", called microcode, can have bugs and newer versions of microcode updates, sent to the CPU at boot by the BIOS/UEFI and/or by the OS, update them.

https://en.wikichip.org/wiki/macro-operation

https://en.wikichip.org/wiki/micro-operation

https://en.wikipedia.org/wiki/Microcode

I'm interested in performance optimization (especially under linux) and its intersection with computer architecture. Would you mind recommending me any resources to get started there?
For x86 specifically, Agner Fog's manual, the Intel and AMD optimization manuals, the SO x86 tag wiki [1].

[1] https://stackoverflow.com/tags/x86/info

That's a great idea, thank you!