Hacker News new | ask | show | jobs
by abainbridge 2172 days ago
What are the forces in chip design that are at play here? Over the last 10-15 years, fabs have continued to fit more and more logic gates per unit area, but haven't reduced the power consumption per gate as much. As a result, if you fill your modern chip with compute gates, you cannot use them all at once because the chip will melt. Or at least you can't have them all running at max clock rates. One solution is to increase the proportion of the chip used for SRAM (it uses less power per unit area than compute gates), this is what Graphcore have done. Another is to put down multiple different compute blocks, each designed for a different purpose, and only use them a-few-at-a-time. The big-little Arm designs in smartphones are an example of that. But I feel like AVX512 might be an example too. When they add ML accelerator blocks next, they also will not be able to be used flat out at the same time as the rest of the cores' resources.

I'm sure Intel should fix the problems Linus is complaining about, but I feel like chip vendors are being forced into this "add special purpose blocks" approach, as the only way to make their new chips better than their old ones.

6 comments

Jim Keller had an interesting talk recently [1] about ways of doing parallel processing to better us the billions of transistors we have - assuming the task is parallelizable. There's the scalar core (i.e the basic CPU) which is easy to program realtively. Then a scalar core with vector instructions - difficult to program efficiently. Then there are arrays of scalar cores, i.e. GPUs, so relatively easy to program again, and now a lot of startups with arrays of scalar cores each with vector engines, so expected to be most difficult to program. He didn't go into why vector instructions are hard to use efficiently, and hard for compiler writers, but I'd be interested if anyone here could explain that.

1. https://youtu.be/8eT1jaHmlx8

Vectorization: I'm not an expert in this area so I can only tell you what I've personally found difficult in dealing with vectorization. Usually it all comes down to alignment and vector lanes. To utilize the vector instructions you basically have to paint your memory into separate (but interleaved) regions that can be mapped to distinct vector lanes efficiently. Everything is fine as long as no two elements from separate lanes have to be mixed in some way, as soon as your computation requires that you incur a heavy cost.

Dealing with these issues might require you to know the corners of the instruction set really well or some times the solution is outside of the instruction set and is related to how your data structure is laid out in memory leading you to AoS vs SoA analysis etc.

Compilers and vectorization: Based on reading a lot of assembly output I think what compilers usually struggle with are assumptions that the human programmer know hold for a given piece of code, but the compiler has no right to make. Some of this is basic alignment, gcc and clang have intrinsics for these. Some times it's related to the memory model of the programming language disallowing a load or a store at specific points.

GPGPU programmability: GPUs being easy to program is something I take with a grain of salt, yes it's easy to get up and running with CUDA. Making an _efficient_ CUDA program however is easily as challenging if not more than writing an efficient AVX program.

Here's more on the problem of SIMD and C compilers:

https://pharr.org/matt/blog/2018/04/18/ispc-origins.html#aut...

> as long as vectorization can fail (and it will), […] you must come to deeply understand the auto-vectorizer. […] This is a horrible way to program; it’s all alchemy and guesswork and you need to become deeply specialized about the nuances of a single compiler’s implementation

GPUs aren't really arrays of scalar cores. All threads in a warp run in lock step. If one takes a branch they all do, with operations being masked off as needed.

It's not all that different conceptually to AVX-512 with mask registers, except the vector size is even larger and of course the programming model differs.

> He didn't go into why vector instructions are hard to use efficiently, and hard for compiler writers, but I'd be interested if anyone here could explain that.

I have a simplistic explanation - maybe not what you're looking for but it is the best I can do...

At 12m23s in the video he says, "If you're working in a layer and the layers are well constructed (abstracted) you really can make a lot of progress. But if the top layer says, 'to make this really fast, go change the bottom layer', then its going to get all tangled up."

That's what implementing an algorithm on a SIMD architecture feels like to me. I have to figure out a way of filling my SIMD width with data each clock cycle, while in contrast, the specification of the algorithm deals with data one piece at a time.

Take insertion sort as a (bad) example.

    i ? 1
    while i < length(A)
        j ? i
        while j > 0 and A[j-1] > A[j]
            swap A[j] and A[j-1]
            j ? j - 1
        end while
        i ? i + 1
    end while
That algorithm cannot easily take advantage of SIMD. You have to change the algorithm to make it work with the architecture.

We'd probably say the algorithm is the top level of the abstraction stack, and the SIMD architecture is a level near the bottom. So this problem is the opposite way around to how Jim phrased it, but the point is that we have NOT got clean abstraction - an implementation in one layer depends on the implementation in another.

Are GPU's really easier to program than scalar w/ SIMD (or vector insns)? The programming models you have to work with for GPGPU seem quite obscure, whereas with CPU and SIMD flipping a compiler switch gets you most of the way there, and self-contained intrinsics do the rest.
GPU programming is easy enough, the complexity comes from the seperate memory system and the tedious(and not portable) API you need to use to access the GPU.

I prefer intrinsics as they give more control than shader languages and they can be written in C++ instead of fiddling with some garbage GPU API that runs async.

MSL, CUDA and SYSCL are C++ with extra topping.

Also one of the reasons CUDA won developer love is that it fully embraced polyglot programming on the GPU.

None of those are both portable and widely available on end user machines, which is needed for games

CUDA seems nice, but being Nvidia only makes it a total dead end.

Disclaimer: I work on AMD ROCm, but my opinions are my own.

There's also HIP[1], which can be used as a thin wrapper around CUDA, or with the ROCm backend on AMD platforms. It doesn't yet match CUDA in either breadth of features or maturity, but it's getting closer every day.

[1]: https://github.com/ROCm-Developer-Tools/HIP

I believe the ML community will strongly disagree. CUDA is everything
Windows and iOS gaming community with disagree will that statement.

Or are you speaking about the 1% Linux users on Steam?

At least part of the problem is that computing mostly depends on moving data. Memory bandwidth is relatively low, so it's difficult to get enough actual floating point intensity, at least for "large" arrays even when it's theoretically available. A classic example is GEMM (generalized matrix multiplication) where you should expect a good implementation to get around 90% of peak performance, but also expect it to jump through various tricky hoops to get there. With, say, vector multiplication the hoops aren't available, and you're ultimately memory-bound. Yes, there's more to it than that, and SIMD has non-FP applications etc.
How does this solve the power problem that GP is talking about?
The power problem is solved by having cores more suited to a task. A CPU is completely general, but power inefficient. Dedicated HW is as efficient as it gets, but in the extreme is not flexible and only does one task well. With loads of extra silicon available, we can now use that for more specific engines/accelerators and of course not all of these would be active at once. So in a way the scaling / density does allow us to get more efficiency in some cases. The trick is finding the balance for a given process node.
> Over the last 10-15 years, fabs have continued to fit more and more logic gates per unit area, but haven't reduced the power consumption per gate as much.

Kids these days get 8 cores for a 100W TDP.

When I was a boy, 100W got you a single core. And you didn't get dynamic frequency scaling, so it'd be putting out that heat all the time.

(We also had to walk to school barefoot in the snow, uphill both ways)

You must be young. Home PC CPUs from my youth drew only single digit watts. They didn't require any fan until the Pentium.
Indeed:

386, introduced 1985:

http://www.cpu-world.com/CPUs/80386/Intel-A80386-16.html

Typical/Maximum power dissipation: 1.85 Watt / 2.3 Watt

And even no Pentium III 1999-2003 needed more than around 30 W:

https://en.wikipedia.org/wiki/List_of_Intel_Pentium_III_micr...

The Pentium II was not as efficient as the III. I remember setting up a dual socket machine where the PS started to matter. The best thing was that the web browser would only suck 100% from one processor.
>putting out that heat all the time

Even if the frequency was fixed, dissipated heat did definitely vary together with the computing load.

> so it'd be putting out that heat all the time

What's the problem? My old school pentiums kept my dorm room nice and toasty. Could keep my window cracked in the winter for fresh air while gentoo compiled...

Could you not use TDP to melt the snow ;)
The main problem is software, with GPGPUs you need to explicitly program for them, while with stuff like AVX there is this implicit hope that you just code as always and the compiler will take care of the rest via auto-vectorization and PhD level optimization algorithms.

Because outside artificial intelligence, graphics and audio, there is little else that common applications would use the GPGPU for, so the large majority of software developers keeps ignoring heterogeneous programming models.

> the compiler will take care of the rest via auto-vectorization and PhD level optimization algorithms.

With how AVX512 is implemented, there isn't much point in a compiler auto optimizing general purpose code to use it, because even if there is a theoretical speedup, it may well be slower in practice.

There might not be one, but all major C, C++, Hotspot, Graal and RyuJIT compilers do it to some extent.
> while with stuff like AVX there is this implicit hope that you just code as always and the compiler will take care of the rest via auto-vectorization and PhD level optimization algorithms

No. I recently could really, really have used the packed saturated integer arithmetic and horizontal addition in AVX2 (but my old machine doesn't support it) and even better, the same but 512 bits wide on AVX512. It would only have been 6 or 7 instructions, if that, but it was inner loop, and mattered. Using compiler intrinsics would have been fine. I think you're looking at things too narrowly.

I am looking at it of the point of view of joe/jane developer that cannot tell head from tail regarding vector programming and doesn't even know what compiler intrinsics are for, and use languages that don't expose them anyway.
Well those people will never be getting the most out of their CPUs to begin with.
Which is the whole point of "this implicit hope that you just code as always and the compiler will take care of the rest via auto-vectorization and PhD level optimization algorithms.", because not only do those people not get it, there is a general decline in using languages that expose vector intrisics like C and C++ for regular LOB applications.
In my ideal world you'd be able to mark a function "this should compile to / run on gpgpu" and the compiler would potentially tell you why it can't do that. I'm not even sure if anything is stopping us apart from implementing that apart from the effort required. Sure, many ways to write that code will result in terrible performance, but it would still be closer to the auto-vectorisation experience.

Actually we already have openmp to cuda (http://www2.engr.arizona.edu/~ece569a/Readings/GPU_Papers/3....) so just making it more production-ready would be perfect.

The current OpenMP spec has GPU offload features specifically for what was expected of the Sierra supercomputer. I'm not sure how relevant a paper that old (relatively, I hasten to add) is.
> Because outside artificial intelligence, graphics and audio, there is little else that common applications would use the GPGPU for, so the large majority of software developers keeps ignoring heterogeneous programming models.

I think you got this backwards - the lack of developers' interest is what leads to the mistaken impression that GPU compute is only good for multimedia and FP-crunching workloads. Even looking at the success of GPU compute in mining cryptocoins (only ASIC's do better) ought to be enough to tell you that we could do a lot more with them if we cared to.

From my point of view cryptomining is a useless fad, and typical line of business applications don't need anything more than what I listed.
>but haven't reduced the power consumption per gate as much.

That is simply not true. You can run the 64 Core on EPYC 2 all at once at 3Ghz all with Air Cooling.

At every node they have reduced power consumption that is also one reason you see continuous performance improvement.

> That is simply not true.

I'm not claiming anything controversial. Power not having scaled as well as area recently is often referred to as the end of Dennard scaling:

https://en.wikipedia.org/wiki/Dennard_scaling#Breakdown_of_D...

> You can run the 64 Core on EPYC 2 all at once at 3Ghz all with Air Cooling.

That can be true despite the fact that power hasn't scaled as well as area.

> At every node they have reduced power consumption

Yep, just not as much as they improved area.

> What are the forces in chip design that are at play here?

The "weak form" of Moore's Law--"Performance doubles every 12-18 months"--is dead and buried.

The "strong form" of Moore's Law is still active--"Transistor cost halves every 12-18 months".

This means that you can't make the primary paths any faster. So, all you can do is add functionality and pray that someone magically can make that functionality relevant to the primary use cases.

AVX is not a "special purpose block", it's Intel's answer to not adding special purpose blocks on customer demand, like you can do with ARM.

Crypto or video decoding comes to mind, those would be much faster with dedicated silicon, but more general AVX instructions can get you halfway there. Well, maybe a quarter. People point out that AVX uses a lot of power, but they ignore that the same algorithm running instead on more but simpler cores would use even more power.

> but more general AVX instructions can get you halfway there

Maybe misunderstand you but there are some fairly non-general ops for encoding/decoding crypto

https://en.wikipedia.org/wiki/AVX-512#VAES

They exist today, but they were added after AVX. Every year we figure out how to cram more transistors on a cubic cm, and once the low hanging fruit was added and we knew how to add more transistors, we decided to start putting more and more specific functions.

That is the point of Linus. He would have preferred to use that increase in transistor count for other things, like more cache.

More cache has diminishing returns, because cache wants to be as close as possible to the core logic. And modern CPU's are mostly cache anyway. Special-purpose blocks for common compute tasks are quite cheap.
>And modern CPU's are mostly cache anyway.

Skylake is less than 30% cache. However internally it's 512bus, thanks to avx-512 - which could be considered suboptimal.

Unsupported by valgrind still. Not sure about qemu. Don't use.