| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by waltpad 2211 days ago

Disclaimer: I am not a HW designer, I could very well be wrong.

It is true that there are tasks where threading matters, but still require a CPU rather than a GPU. I wonder however if these tasks do need full SSE/AVX etc. Couldn't these extensions be removed of the CPU cores and instead have the necessary work performed by the GPU?

It would be interesting to produce statistics on how much these extensions are used in these scenario. Imagine how much space and complexity could be saved on a CPU die by making stripped down versions. That space could in turn be used for more cores!

I read a little about the Xeon PHI cpus, which iirc, is a multicore CPU with a very small ISA, but I wonder why x86 makers aren't trying to go in that direction: isn't there plenty of dedicated workloads which would happily run on these (eg, web servers), or is this just a (too) simplistic view?

3 comments

dragontamer 2211 days ago

> It is true that there are tasks where threading matters, but still require a CPU rather than a GPU. I wonder however if these tasks do need full SSE/AVX etc. Couldn't these extensions be removed of the CPU cores and instead have the necessary work performed by the GPU?

SSE/AVX shares an L1 cache that's damn near instantaneous to access for the CPU core. Total L1 bandwidth is on the scale of TB/s.

PCIe -> GPU takes 1-microsecond to 10-microseconds per access, and operates only at 50GB/s (or 1/20th the speed of L1 bandwidths).

------------

Case in point: Memset is very commonly AVX'd to clear out L1 cache and initialize ~1kb to 32kb of data to 0 as quickly as possible.

There's no way for "memset" to move from CPU to GPU unless you feel like obliterating the entire point of L1, L2, and L3 cache. If you moved a "memset" to GPU, it'd operate only at 15GB/s (the speed of PCIe 3.0 x16 lanes), far, far slower than L1 cache AVX-loads/stores.

SIMD units, like SSE and AVX, are highly "local" and have huge advantages.

link

TinkersW 2210 days ago

I think the opposite is where things need to go. Having a wide SIMD ALU quickly accessible from your CPU core is very useful, especially as it shares the same memory system and a much more flexible programming model that allows you to do everything in a single source.

link

waltpad 2210 days ago

The programming model is not very flexible at the lowest level: one has to create all the software infrastructure to communicate with the GPU (which boils down to sending commands and receiving response). There are languages (like futhark, julia, or even python), which handle all that boilerplate transparently.

The main problem is, afaik, that there is not enough control about where the code will run in these languages. At some point, one will want to describe all the algorithms using a single language, and somehow describe how the workload will have to be distributed across all the processors, or at least that's what I've been thinking about for a while. Once you have that level of control, the need for a versatile CPU is less clear. Note that nowadays people seems happy with hybrid solutions where the code is scattered across several languages (eg, one for the main program and one for the shaders, or for the client side UI), so my position is maybe not very strong.

HW-wise, is it possible that integrated GPUs are the first steps toward an architecture where CPU and GPU have better interconnections (ie, larger communication bandwidth and smaller latency) to the point where SIMD becomes moot? There is also the SWAR approach, where one doesn't rely on intrinsic SIMD instructions, but instead emulate them (though it's probably not very realistic for floating point computation).

Some other ideas:

- Apple has this neural engine in their latest chips, which is basically dedicated HW for neural networks

- In the wild, people are getting more and more interested in building their custom ASICs to cut software's middle-man cost: for them, the CPU solution is not good enough

- Intel recently introduced a new matrix ops extension in their CPUs: maybe at some point they'll introduce full GPU capabilities directly baked in the CPU? I am a little worried about the resulting ISA.

Anyway, I am not an HW engineer, nor a very good software one. I only have a limited view of the difficulties in writing good, CPU or GPU efficient code. My first post was prompted by remembering the first "large scale" multicores CPUs 15 years ago (specifically the Ultrasparc T1) which wheren't SIMD heavy. The direction naturally shifted as progress was made on SIMD to try to compete with GPUs, when it seems to me that originally CPUs and GPUs were complementary.

I tend to support modular solutions, but I don't know how costly that would be in term of efficiency at the HW level.

link

waltpad 2208 days ago

Ooops, I didn't mean Xeon PHI, I meant an older design with many small x86 cores.

Xeon PHI on the other hand was the first host of AVX-512 instruction set. Sorry.

link