CPU-SIMD is less about competing against GPUs and more about latency.
GPUs will always have more GFlops and memory bandwidth at a lower cost. They're specifically built GFlop and memory-bandwidth machines. Case in point: the NVidia 2070 Super is 8 TFlops of compute at $400, a tiny fraction of what this POWER10 will cost.
If POWER10 costs anything like POWER9, we're looking at well over $2000 for the bigger chips and $1000 for reasonable multisocket motherboards. And holy moly: 602mm^2 at 7nm is going to be EXPENSIVE. EDIT: I'm only calculating ~2 TFlops from the hypothetical 60x SMT4 Power10 at 4GHz. That's no where close to GPU-level Flops.
However, the CPU-GPU link is slow in comparison to CPU-L1 cache (or even CPU-DDR4 / DDR5). A CPU can "win the race" by using SIMD onboard, completing your task before it even spent the ~5-microseconds needed to communicate to the GPU.
----------
With that being said: POWER10 also implements PCIe 5.0, which means it will be one of the fastest processors for communicating with future GPUs.
It's more appropriate to compare pricing of Tesla with a datacenter-grade CPU like POWER10 (or Epyc/Xeon/etc.).
A64FX (in Fugaku, the current #1 machine on all popular supercomputing benchmarks) has shown that CPUs can compete with top-shelf GPUs on bandwidth and floating point energy efficiency.
Fugaku has 158,976 nodes x2 chips each, or 317,952 A64FX chips.
Summit has 4,608 nodes x 6 GPUs each, or 27,648 V100 GPUs. It also was built back in 2018.
---------
While Fugaku is certainly an interesting design, it seems inevitable that a modern GPU (say A100 Amperes) would crush it in FLOPs. Really, Fugaku's most interesting point is its high rate of HPCG, showing that its interconnect is hugely efficient.
Per-node, Fugaku is weaker. They built an amazing interconnect to compensate for that weakness. Fugaku also is an HBM-based computer, meaning you cannot easily add or remove RAM (like a CPU / GPU team can configure to more, or less RAM by adding sticks).
These are the little differences that make a difference in practicality. But yes, A64FX is certainly an accomplishment, but I wouldn't go so far as to say its proven that CPUs can keep up with GPUs in terms of raw FLOPs.
A100 has a 20% edge on energy efficiency for HPL, along with higher intrinsic latencies. It's also 6-12 months behind A64FX in deployment. https://www.top500.org/lists/green500/2020/06/
HPCG mostly tests memory bandwidth rather than interconnect, but Fugaku does have a great network.
Adding DRAM to a GPU-heavy machine has limited benefit due to the relatively low bandwidth to the device. They're effectively both HBM machines if you need the ~TB bandwidth per device (or per socket).
Normalizing per node (versus per energy or cost) isn't particularly useful unless your software doesn't work well with distributed memory.
> Adding DRAM to a GPU-heavy machine has limited benefit due to the relatively low bandwidth to the device. They're effectively both HBM machines if you need the ~TB bandwidth per device (or per socket).
This POWER10 chip under discussion has 1TB bandwidth to devices with expandable RAM.
Yeah, I didn't think it was possible. But... congrats to IBM for getting this done. Within the context of this hypothetical POWER10, 1TB bandwidth interconnects to expandable RAM is on the table.
It's 410 GB/s peak for DDR5. The "up to 800 GB/s sustained" is for GDDR6 and POWER10 isn't slated to ship until Q4 2021 so it isn't really a direct comparison with hardware that was deployed in 2019.
Your point about latency is the reason why we can't use GPUs for realtime audio processing, even though they would be absolutely otherwise well-suited. Stuff like computing FFTs and convolution could be done much faster on GPUs, but the latency would be prohibitive in realtime audio.
GPU latency is ~5 microseconds per kernel, plus the time it takes to transfer data into and out of the GPU. (~15GBps on PCIe 3.0 x16 lanes). Given that audio is probably less than 1MBps, PCIe bandwidth won't be an issue at all.
Any audio system based on USB controls would be on the order of 1000 microseconds of latency (1,000,000 microseconds/1000 USB updates per second == 1000 microseconds). Lets assume that we have a hard realtime cutoff of 1000 microseconds.
While GPU-latency is an issue for tiny compute tasks, I don't think its actually big enough to make a huge difference for audio applications (which usually use standard USB controllers at 1ms specified latency).
I mean, you only have room for 200 CPU-GPU transfers (5-microseconds per CPU-GPU message), but if the entire audio-calculation was completed inside the GPU, you wouldn't need any more than 1-message there and 1-message back.
Depends on the application. Your brain can notice the delay when playing something like digital drums after ~10-15 ms. Something with less 'attack' like guitar has a bit more wiggle room, and ambient synths even more so.
edit: also, vocals are most latency-sensitive if you are the one singing and hearing it played back at the same time.
> They're specifically built GFlop and memory-bandwidth machines
That would indicate their theoretical peak performance is higher. Unless you can line up your data so that the GPU will be processing it all the time in all its compute units without any memory latency, you won't get the theoretical peak. In those cases, it's perfectly possible a beefy CPU will be able to out-supercompute a GPU-based machine. It's just that some problems are more amenable to some architectures.
For games and graphics? I don't think so, a GPU has a ton of dedicated hardware that is very costly to simulate in software: triangle setup, rasterizers, tessellation units, texture mapping units, ROPs... and now even raytracing units.
In a sense yes, and this has already happened. It's just that GPUs are under lock and key just like early processors were. There are interesting development leaks you can find, like NVIDIA cards supporting USB connectors on some models, implying you can use just the GPU as a full computer.
GPUs will always have more GFlops and memory bandwidth at a lower cost. They're specifically built GFlop and memory-bandwidth machines. Case in point: the NVidia 2070 Super is 8 TFlops of compute at $400, a tiny fraction of what this POWER10 will cost.
If POWER10 costs anything like POWER9, we're looking at well over $2000 for the bigger chips and $1000 for reasonable multisocket motherboards. And holy moly: 602mm^2 at 7nm is going to be EXPENSIVE. EDIT: I'm only calculating ~2 TFlops from the hypothetical 60x SMT4 Power10 at 4GHz. That's no where close to GPU-level Flops.
However, the CPU-GPU link is slow in comparison to CPU-L1 cache (or even CPU-DDR4 / DDR5). A CPU can "win the race" by using SIMD onboard, completing your task before it even spent the ~5-microseconds needed to communicate to the GPU.
----------
With that being said: POWER10 also implements PCIe 5.0, which means it will be one of the fastest processors for communicating with future GPUs.