| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tails4e 2172 days ago
	Jim Keller had an interesting talk recently [1] about ways of doing parallel processing to better us the billions of transistors we have - assuming the task is parallelizable. There's the scalar core (i.e the basic CPU) which is easy to program realtively. Then a scalar core with vector instructions - difficult to program efficiently. Then there are arrays of scalar cores, i.e. GPUs, so relatively easy to program again, and now a lot of startups with arrays of scalar cores each with vector engines, so expected to be most difficult to program. He didn't go into why vector instructions are hard to use efficiently, and hard for compiler writers, but I'd be interested if anyone here could explain that. 1. https://youtu.be/8eT1jaHmlx8

7 comments

confuseshrink 2172 days ago

Vectorization: I'm not an expert in this area so I can only tell you what I've personally found difficult in dealing with vectorization. Usually it all comes down to alignment and vector lanes. To utilize the vector instructions you basically have to paint your memory into separate (but interleaved) regions that can be mapped to distinct vector lanes efficiently. Everything is fine as long as no two elements from separate lanes have to be mixed in some way, as soon as your computation requires that you incur a heavy cost.

Dealing with these issues might require you to know the corners of the instruction set really well or some times the solution is outside of the instruction set and is related to how your data structure is laid out in memory leading you to AoS vs SoA analysis etc.

Compilers and vectorization: Based on reading a lot of assembly output I think what compilers usually struggle with are assumptions that the human programmer know hold for a given piece of code, but the compiler has no right to make. Some of this is basic alignment, gcc and clang have intrinsics for these. Some times it's related to the memory model of the programming language disallowing a load or a store at specific points.

GPGPU programmability: GPUs being easy to program is something I take with a grain of salt, yes it's easy to get up and running with CUDA. Making an _efficient_ CUDA program however is easily as challenging if not more than writing an efficient AVX program.

pornel 2172 days ago

Here's more on the problem of SIMD and C compilers:

https://pharr.org/matt/blog/2018/04/18/ispc-origins.html#aut...

> as long as vectorization can fail (and it will), […] you must come to deeply understand the auto-vectorizer. […] This is a horrible way to program; it’s all alchemy and guesswork and you need to become deeply specialized about the nuances of a single compiler’s implementation

reitzensteinm 2172 days ago

GPUs aren't really arrays of scalar cores. All threads in a warp run in lock step. If one takes a branch they all do, with operations being masked off as needed.

It's not all that different conceptually to AVX-512 with mask registers, except the vector size is even larger and of course the programming model differs.

abainbridge 2172 days ago

> He didn't go into why vector instructions are hard to use efficiently, and hard for compiler writers, but I'd be interested if anyone here could explain that.

I have a simplistic explanation - maybe not what you're looking for but it is the best I can do...

At 12m23s in the video he says, "If you're working in a layer and the layers are well constructed (abstracted) you really can make a lot of progress. But if the top layer says, 'to make this really fast, go change the bottom layer', then its going to get all tangled up."

That's what implementing an algorithm on a SIMD architecture feels like to me. I have to figure out a way of filling my SIMD width with data each clock cycle, while in contrast, the specification of the algorithm deals with data one piece at a time.

Take insertion sort as a (bad) example.

    i ? 1
    while i < length(A)
        j ? i
        while j > 0 and A[j-1] > A[j]
            swap A[j] and A[j-1]
            j ? j - 1
        end while
        i ? i + 1
    end while

That algorithm cannot easily take advantage of SIMD. You have to change the algorithm to make it work with the architecture.

We'd probably say the algorithm is the top level of the abstraction stack, and the SIMD architecture is a level near the bottom. So this problem is the opposite way around to how Jim phrased it, but the point is that we have NOT got clean abstraction - an implementation in one layer depends on the implementation in another.

zozbot234 2172 days ago

Are GPU's really easier to program than scalar w/ SIMD (or vector insns)? The programming models you have to work with for GPGPU seem quite obscure, whereas with CPU and SIMD flipping a compiler switch gets you most of the way there, and self-contained intrinsics do the rest.

TinkersW 2172 days ago

GPU programming is easy enough, the complexity comes from the seperate memory system and the tedious(and not portable) API you need to use to access the GPU.

I prefer intrinsics as they give more control than shader languages and they can be written in C++ instead of fiddling with some garbage GPU API that runs async.

pjmlp 2172 days ago

MSL, CUDA and SYSCL are C++ with extra topping.

Also one of the reasons CUDA won developer love is that it fully embraced polyglot programming on the GPU.

TinkersW 2172 days ago

None of those are both portable and widely available on end user machines, which is needed for games

CUDA seems nice, but being Nvidia only makes it a total dead end.

slavik81 2171 days ago

Disclaimer: I work on AMD ROCm, but my opinions are my own.

There's also HIP[1], which can be used as a thin wrapper around CUDA, or with the ROCm backend on AMD platforms. It doesn't yet match CUDA in either breadth of features or maturity, but it's getting closer every day.

[1]: https://github.com/ROCm-Developer-Tools/HIP

gnufx 2171 days ago

As I understand it, that has to work for the CORAL 2 US "exascale", so people who've been proved fairly right so far obviously have some confidence in it. (de Supinksi of Livermore said he'd be out of a job if conventional wisdom was right, though it was pretty obvious at the time that it wasn't.) Free software too, praise be.

TinkersW 2170 days ago

It looks good but without Intel iGPU support I don't think any gamedevs would use it :/

I wish all the GPU companies would get together and make a standard based on C++ and stick with it.

kmbriedis 2172 days ago

I believe the ML community will strongly disagree. CUDA is everything

ethelward 2171 days ago

Because the academic ML community does not care about shipping product to end users not equipped in nVidia.

pjmlp 2172 days ago

Windows and iOS gaming community with disagree will that statement.

Or are you speaking about the 1% Linux users on Steam?

gnufx 2171 days ago

At least part of the problem is that computing mostly depends on moving data. Memory bandwidth is relatively low, so it's difficult to get enough actual floating point intensity, at least for "large" arrays even when it's theoretically available. A classic example is GEMM (generalized matrix multiplication) where you should expect a good implementation to get around 90% of peak performance, but also expect it to jump through various tricky hoops to get there. With, say, vector multiplication the hoops aren't available, and you're ultimately memory-bound. Yes, there's more to it than that, and SIMD has non-FP applications etc.

amelius 2172 days ago

How does this solve the power problem that GP is talking about?

tails4e 2171 days ago

The power problem is solved by having cores more suited to a task. A CPU is completely general, but power inefficient. Dedicated HW is as efficient as it gets, but in the extreme is not flexible and only does one task well. With loads of extra silicon available, we can now use that for more specific engines/accelerators and of course not all of these would be active at once. So in a way the scaling / density does allow us to get more efficiency in some cases. The trick is finding the balance for a given process node.