|
|
|
|
|
by waltpad
2164 days ago
|
|
Disclaimer: I am not a HW designer, I could very well be wrong. It is true that there are tasks where threading matters, but still require a CPU rather than a GPU. I wonder however if these tasks do need full SSE/AVX etc. Couldn't these extensions be removed of the CPU cores and instead have the necessary work performed by the GPU? It would be interesting to produce statistics on how much these extensions are used in these scenario. Imagine how much space and complexity could be saved on a CPU die by making stripped down versions. That space could in turn be used for more cores! I read a little about the Xeon PHI cpus, which iirc, is a multicore CPU with a very small ISA, but I wonder why x86 makers aren't trying to go in that direction: isn't there plenty of dedicated workloads which would happily run on these (eg, web servers), or is this just a (too) simplistic view? |
|
SSE/AVX shares an L1 cache that's damn near instantaneous to access for the CPU core. Total L1 bandwidth is on the scale of TB/s.
PCIe -> GPU takes 1-microsecond to 10-microseconds per access, and operates only at 50GB/s (or 1/20th the speed of L1 bandwidths).
------------
Case in point: Memset is very commonly AVX'd to clear out L1 cache and initialize ~1kb to 32kb of data to 0 as quickly as possible.
There's no way for "memset" to move from CPU to GPU unless you feel like obliterating the entire point of L1, L2, and L3 cache. If you moved a "memset" to GPU, it'd operate only at 15GB/s (the speed of PCIe 3.0 x16 lanes), far, far slower than L1 cache AVX-loads/stores.
SIMD units, like SSE and AVX, are highly "local" and have huge advantages.