| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by evoke4908 558 days ago

How is it any different from ordinary parallel compilation? Make will happily use dozens of CPU cores for compilation, even if linking and other operations must be synchronous.

Even if a GPU core is slower than a CPU core, you have vastly more of them. If you have a project with more TUs than you have CPU cores, I don't see how it couldn't be faster.

Hell, you can even trivially compile on different remote machines with distcc. If that's faster, how could a GPU be worse?

4 comments

nonameiguess 558 days ago

Make isn't a compiler. It's a task graph engine and it can run targets in parallel if they don't depend on each other. Many of those targets will be calls to a compiler that won't themselves be parallel. If you're building 8 shared object library files and 6 executables, which share code but don't otherwise depend on each other, then run tests after the build, in addition to preprocesor macro expansions and what-not, there's a lot that can be run in parallel even if no calls to the compiler itself can be.

GNU does have a separate project to parallelize gcc: https://gcc.gnu.org/wiki/ParallelGcc. The Wiki has limitations, challenges, and benchmarks. It can speed things up a tiny bit, but barely, and benefits seem to vanish after 4 threads.

link

almostgotcaught 558 days ago

I think the funny thing about this discussion is everyone agrees with the sentiment of "9 women can't make a baby in one month" but somehow the intuition is forgotten when it comes to compilers and parsers and etc. Like if human engineers with human brains can't parallelize program construction, what hope do computers have?

Ultimately this, like many things, boils down to P=?NP (search/optimization is NP hard).

link

incrudible 558 days ago

You actually do not have vastly more cores, to a first approximation a CPU core is equivalent to a streaming multiprocessor (in NVIDIA parlance) on the GPU. A 14900K has 24 cores (of two kinds) and a similarly big 3060 has 28SMs. The GPU effectively trades all the deep pipelining and branch prediction for much wider SIMD. That makes it massively slower for any code that involves branching and massively faster for any code that is data parallel.

link

kcb 558 days ago

Probably more accurate to multiply that by 4. Each SM is split into 4 partitions that can each execute different instructions but with shared L1 cache.

link

dragontamer 558 days ago

GPU style SIMD parallelization cannot take separate if/else branches.

If/else in GPU land is implemented by having the GPU execute the if() side with (EXEC-mask), THEN the else() side with (not EXEC-mask).

Ie: the exec mask makes the appearance of skipping over the unnecessary code. But in practice, one of the 32 CUDA threads executes on one or the other branch. And this the system must physically execute both while throwing away the results.

---------

CPU parallelization in contrast is a true skip of the unnecessary if/else side. It takes a branch predictor to do it well though.

This also means that in a GPU, if one (of the 32 CUDA threads aka lanes) needs to loop 10,000 times, then ALL the CUDA lanes loop 10,000 times (with the other lanes possibly throwing away 9,999+ iterations of work as waste heat).

link

almostgotcaught 558 days ago

> How is it any different from ordinary parallel compilation? Make will happily use dozens of CPU cores for compilation, even if linking and other operations must be synchronous

It's not any different because ordinary compilation isn't parallelizable either. Chopping up your program into TUs is a work around for the fact that it isn't, not proof that it is. Think about it: why does ODR exist? Why does LTO exist? Also think about it: is linking parallelizable?

link

nuudlman 558 days ago

Linking is very parallelizable: see the mold linker for a demonstration.

link

almostgotcaught 558 days ago

I feel no one in these comments knows the difference between weak scaling and strong scaling.

link