Hacker News new | ask | show | jobs
by BeeOnRope 2850 days ago
I think the claim that parallel compilation with gcc is memory bandwidth bound is unlikely. gcc is known to be a very pointer-chasy, branch-mispredicty load that is highly sensitive to memory latency - far from a streaming load that is sensitive to raw bandwidth.

Still, the conclusion holds: if most of the time is spent waiting for values to come back from memory, a higher core frequency has strongly diminishing returns.

3 comments

That's only true if you only compile a single file at once which is an exceedingly rare use case for a build server. As soon as you compile files in parallel the CPU can simply switch to the next hardware thread during a memory load from main memory. Then there is the fact that dual channel DDR4 just doesn't provide a lot of memory bandwidth in the first place. A 16 core/32 thread desktop CPU is probably not going to happen on the AM4/Ryzen platform even if everything suddenly supports multi-threading on 16 cores simply because the memory bandwidth isn't enough to translate into meaningful performance increases. GPUs have horrendous memory latencies but they perform well precisely because they can just switch to the next thread and execute that one while waiting.
Well you are mixing the effect of "more cores" and SMT together here. Sure, SMT helps hide some latency effects, but it doesn't significantly increase the demand for bandwidth. The increased bandwidth requirements when introducing SMT are probably approximately modeled by the increase in performance: so a 30% uplift from running two hardware threads per core means that bandwidth requirement increases by about 30%.

That's not enough to turn gcc from a largely latency bound load to a memory bandwidth hog!

Ryzen only has two threads per core, so one would be able to see at most a 2x gain. That's not insignificant, but still far from what one needs to start seeing bandwidth problems.
Closer to zero returns and you still get to improve latency hiding and memory controller design. (including bit width and block sizes)

Even more cache ways won't help too much in this workload.

Unless the AMD design is unusual it is not very close to zero return: a significant part of the "path to memory" involves things run at the core clock, in particular everything from the core to the L2 and probably some part of the coordination logic which communicates with the "uncore". I'm not sure about AMD chips, but on some chips there is a relationship between the uncore speed and the core speed: e.g., the uncore speed might often be the same as the maximum core speed for any core on the socket.

Adding to that, there are other effects that allow core frequency to leak into the performance of memory-bound programs, such as a higher frequency allowing the core run ahead more quickly to get more memory requests in flight, recover more quickly after a branch misprediction, etc. Try it sometime: find something which is really memory bound and crank the frequency way down: there will probably be a significant effect, but not nearly in proportion to the frequency difference.

I wonder if his results still hold for gcc -O3?

That be much more CPU-bound. Yes - some optimization will do global traversals.

I wonder if javac/clang has the same characteristics as gcc.

Yeah maybe. I haven't found a huge difference between -O3 and -O2, but maybe I haven't been running big enough compiles.