Hacker News new | ask | show | jobs
by raphlinus 1514 days ago
Good question! There are two separate issues with putting the GPU in the same package as the CPU. One is the memcpy bandwidth issue, which is indeed entirely mitigated (assuming the app is smart enough to exploit this). But the round trip times seem more related to context switches. I have an M1 Max here, and just found ~200µs for a very simple dispatch (just clearing 16k of memory).

I personally believe it may be possible to reduce latency using techniques similar to io_uring, but it may not be simple. Likely a major reason for the roundtrips is so that a trusted process (part of the GPU driver) can validate inputs from untrusted user code before it's presented to the GPU hardware.

2 comments

Yes I think you are right about driver overhead, although there should be ways to amortize that it probably doesn't work very well for latency sensitive problems! I expect that in most cases if you have enough work to do to make using AVX512 worthwhile you can afford the round-trip.
It's been a while, but IIRC the integrated GPUs are only L3-cache coherent. So while that greatly improves the memcpy problem, anything that would have fit in L1 and does a bunch of math may still be a better fit for AVX2 or AVX-512.