| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by chrsig 797 days ago
	I'm by no means an expert in the topic, but to share my take anyway: It seems to me like there's just diminishing returns in SIMD approaches. If you're going to organize your data well for SIMD use then it's not a far reach to make it work well on a gpu, which will keep getting more cores. I imagine we'll get to a point where CPUs are actually just pretty dumb drivers for issuing gpu commands.

2 comments

gdiamos 797 days ago

As someone who worked on CUDA 15 years ago - it’s amazing to me that someone on the internet posted this statement.

Did GPUs win?

link

chrsig 797 days ago

I don't think that there's a "win" here. It's just sort of which way you tilt your head, how much space do you have to cram a ton of cores connected to a really wide memory bus and how close can you get the storage while keeping everything from catching on fire, no? ("just sort of" is going to have to skip leg day because of the herculean lift it just did)

It's a fairly fractal pattern in distributing computing. Move the high throughput heavy computation bits away from the low latency responsive bits ("low latency" here is relative to the total computation). Use an event loop for the reactive bits. Eventually someone will invert the event loop to use coroutines so everything looks synchronous (Go, anyone? python's gevent?).

After it seems to me that the only real question is if takes too long or costs too much to move the data to the storage location the heavy computation hardware uses. There's really not much of a conceptual difference between airflow driving snowflake and c++ running on a cpu driving cuda kernels. It takes a certain scale to make going from a OLTP database to an OLAP database worth it, just like it takes a certain scale to make a GPU worth it over simd instructions on the local processor.

link

thrtythreeforty 797 days ago

Yes and no. The compute density and memory bandwidth is unmatched. But the programming model is markedly worse, even for something like CUDA: you inherently have to think about parallelism, how to organize data, write your kernels in a special language, deal with wacky toolchains, and still get to deal with the CPU and operating system.

There is great power in the convenience of "with open('foo') as f:". Most workloads are still stitching together I/O bound APIs, not doing memory-bound or CPU-bound compute.

link

gdiamos 797 days ago

CUDA was always harder to program - even if you could get better perf

It took a long time to find something that really took advantage of it, but we did eventually. CUDA enabled deep learning which enabled LLMs . That's history.

What surprised me about the statement was that it implied that the model of python driving optimized GPU kernels was broader than deep learning.

That was the original vision of CUDA - most of the computational work being done by massively parallel cores

link

CyberDildonics 797 days ago

Win what? This person said they were inexperienced. SIMD is extremely valuable and the situations where it works well are not rare at all.

link

VHRanger 797 days ago

Not really, no.

GPUs are still very limited, even compared to the SIMD instruction set. You couldn't make a CUDAjson the same way the SIMDjson library is built for example, because it doesnt handle SIMD branching in a way that accomodates it.

Second, again, the latency issue. GPUs are only good if you have a pipeline of data to constantly feed it, so that the PCIe transfer latency issue is minimal.

link

touisteur 796 days ago

With PCIe 4 and 5 the latency issues are not as much a problem as they were, what with latency masking, gpudirect/storage-direct, busy-loop kernels (and hopefully soon scheduling libraries to make them easier to use) :-) and if you're really into real-time, computing time on NVIDIA GPUs has excellent jitter/stability and they are used in the very tight control loop of adaptive-optics (1ms-loop with mechanical actuators to drive).

The penalty for branching has reduced in the last years, but yeah it's still heavy, but if you're OK with a bit of wasted compute, you can do some 'speculative' execution and do both branches in different warps, use only one result...

But yes, you're still using an accelerator.

link