|
I'm not a gamedev, but I do a lot of numerical work. GPUs are great, but they're no replacement for SIMD. For example, I just made a little example on my desktop where I summed up 256 random Float32 numbers, and doing it in serial takes around 152 nanoseconds, whereas doing it with SIMD took just 10 nanoseconds. Doing the exact same thing with my GPU took 20 microseconds, so 2000x slower: julia> using CUDA, SIMD, BenchmarkTools
julia> function vsum(::Type{Vec{N, T}}, v::Vector{T}) where {N, T}
s = Vec{N, T}(0)
lane = VecRange{N}(0)
for i ∈ 1:N:length(v)
s += v[lane + i]
end
sum(s)
end;
julia> let L = 256
print("Serial benchmark: "); @btime vsum(Vec{1, Float32}, v) setup=(v=rand(Float32, $L))
print("SIMD benchmark: "); @btime vsum(Vec{16, Float32}, v) setup=(v=rand(Float32, $L))
print("GPU benchmark: "); @btime sum(v) setup=(v=CUDA.rand($L))
end;
Serial benchmark: 152.239 ns (0 allocations: 0 bytes)
SIMD benchmark: 10.359 ns (0 allocations: 0 bytes)
GPU benchmark: 19.917 μs (56 allocations: 1.47 KiB)
The reason for that is simply that it just takes that long to send data back and forth to the GPU and launch a kernel. Almost none of that time was actually spent doing the computation. E.g. here's what that benchmark looks like if instead I have 256^2 numbers: julia> let L = 256^2
print("Serial benchmark: "); @btime vsum(Vec{1, Float32}, v) setup=(v=rand(Float32, $L))
print("SIMD benchmark: "); @btime vsum(Vec{16, Float32}, v) setup=(v=rand(Float32, $L))
print("GPU benchmark: "); @btime sum(v) setup=(v=CUDA.rand($L))
end;
Serial benchmark: 42.370 μs (0 allocations: 0 bytes)
SIMD benchmark: 2.669 μs (0 allocations: 0 bytes)
GPU benchmark: 27.592 μs (112 allocations: 2.97 KiB)
so we're now at the point where the GPU is faster than serial, but still slower than SIMD. If we go up to 256^3 numbers, now we're able to see a convincing advantage for the GPU: julia> let L = 256^3
print("Serial benchmark: "); @btime vsum(Vec{1, Float32}, v) setup=(v=rand(Float32, $L))
print("SIMD benchmark: "); @btime vsum(Vec{16, Float32}, v) setup=(v=rand(Float32, $L))
print("GPU benchmark: "); @btime sum(v) setup=(v=CUDA.rand($L))
end;
Serial benchmark: 11.024 ms (0 allocations: 0 bytes)
SIMD benchmark: 2.061 ms (0 allocations: 0 bytes)
GPU benchmark: 353.119 μs (113 allocations: 2.98 KiB)
So the lesson here is that GPUs are only worth it if you actually have enough data to saturate the GPU, but otherwise you're way better off using SIMD.GPUs are also just generally a lot more limiting than SIMD in many other ways. |
> GPUs are also just generally a lot more limiting than SIMD in many other ways.
What do you mean? (besides things like CUDA being available only on Nvidia/fragmentation issues.)