Hacker News new | ask | show | jobs
by evoke4908 558 days ago
How is it any different from ordinary parallel compilation? Make will happily use dozens of CPU cores for compilation, even if linking and other operations must be synchronous.

Even if a GPU core is slower than a CPU core, you have vastly more of them. If you have a project with more TUs than you have CPU cores, I don't see how it couldn't be faster.

Hell, you can even trivially compile on different remote machines with distcc. If that's faster, how could a GPU be worse?

4 comments

Make isn't a compiler. It's a task graph engine and it can run targets in parallel if they don't depend on each other. Many of those targets will be calls to a compiler that won't themselves be parallel. If you're building 8 shared object library files and 6 executables, which share code but don't otherwise depend on each other, then run tests after the build, in addition to preprocesor macro expansions and what-not, there's a lot that can be run in parallel even if no calls to the compiler itself can be.

GNU does have a separate project to parallelize gcc: https://gcc.gnu.org/wiki/ParallelGcc. The Wiki has limitations, challenges, and benchmarks. It can speed things up a tiny bit, but barely, and benefits seem to vanish after 4 threads.

I think the funny thing about this discussion is everyone agrees with the sentiment of "9 women can't make a baby in one month" but somehow the intuition is forgotten when it comes to compilers and parsers and etc. Like if human engineers with human brains can't parallelize program construction, what hope do computers have?

Ultimately this, like many things, boils down to P=?NP (search/optimization is NP hard).

You actually do not have vastly more cores, to a first approximation a CPU core is equivalent to a streaming multiprocessor (in NVIDIA parlance) on the GPU. A 14900K has 24 cores (of two kinds) and a similarly big 3060 has 28SMs. The GPU effectively trades all the deep pipelining and branch prediction for much wider SIMD. That makes it massively slower for any code that involves branching and massively faster for any code that is data parallel.
Probably more accurate to multiply that by 4. Each SM is split into 4 partitions that can each execute different instructions but with shared L1 cache.
GPU style SIMD parallelization cannot take separate if/else branches.

If/else in GPU land is implemented by having the GPU execute the if() side with (EXEC-mask), THEN the else() side with (not EXEC-mask).

Ie: the exec mask makes the appearance of skipping over the unnecessary code. But in practice, one of the 32 CUDA threads executes on one or the other branch. And this the system must physically execute both while throwing away the results.

---------

CPU parallelization in contrast is a true skip of the unnecessary if/else side. It takes a branch predictor to do it well though.

This also means that in a GPU, if one (of the 32 CUDA threads aka lanes) needs to loop 10,000 times, then ALL the CUDA lanes loop 10,000 times (with the other lanes possibly throwing away 9,999+ iterations of work as waste heat).

> How is it any different from ordinary parallel compilation? Make will happily use dozens of CPU cores for compilation, even if linking and other operations must be synchronous

It's not any different because ordinary compilation isn't parallelizable either. Chopping up your program into TUs is a work around for the fact that it isn't, not proof that it is. Think about it: why does ODR exist? Why does LTO exist? Also think about it: is linking parallelizable?

Linking is very parallelizable: see the mold linker for a demonstration.
I feel no one in these comments knows the difference between weak scaling and strong scaling.