| HN Mirror

julia> using CUDA, SIMD, BenchmarkTools julia> function vsum(::Type{Vec{N, T}}, v::Vector{T}) where {N, T} s = Vec{N, T}(0) lane = VecRange{N}(0) for i ∈ 1:N:length(v) s += v[lane + i] end sum(s) end; julia> let L = 256 print("Serial benchmark: "); @btime vsum(Vec{1, Float32}, v) setup=(v=rand(Float32, $L)) print("SIMD benchmark: "); @btime vsum(Vec{16, Float32}, v) setup=(v=rand(Float32, $L)) print("GPU benchmark: "); @btime sum(v) setup=(v=CUDA.rand($L)) end; Serial benchmark: 152.239 ns (0 allocations: 0 bytes) SIMD benchmark: 10.359 ns (0 allocations: 0 bytes) GPU benchmark: 19.917 μs (56 allocations: 1.47 KiB)

julia> let L = 256^2 print("Serial benchmark: "); @btime vsum(Vec{1, Float32}, v) setup=(v=rand(Float32, $L)) print("SIMD benchmark: "); @btime vsum(Vec{16, Float32}, v) setup=(v=rand(Float32, $L)) print("GPU benchmark: "); @btime sum(v) setup=(v=CUDA.rand($L)) end; Serial benchmark: 42.370 μs (0 allocations: 0 bytes) SIMD benchmark: 2.669 μs (0 allocations: 0 bytes) GPU benchmark: 27.592 μs (112 allocations: 2.97 KiB)

julia> let L = 256^3 print("Serial benchmark: "); @btime vsum(Vec{1, Float32}, v) setup=(v=rand(Float32, $L)) print("SIMD benchmark: "); @btime vsum(Vec{16, Float32}, v) setup=(v=rand(Float32, $L)) print("GPU benchmark: "); @btime sum(v) setup=(v=CUDA.rand($L)) end; Serial benchmark: 11.024 ms (0 allocations: 0 bytes) SIMD benchmark: 2.061 ms (0 allocations: 0 bytes) GPU benchmark: 353.119 μs (113 allocations: 2.98 KiB)

Here's a few random limitations I can think of other than those already mentioned:

* Float64 math is typically around 30x slower than Float32 math on "consumer-grade" GPUs due to an arbitrary limitation to stop people from using consumer grade chips for "workstation" purposes. This turns out to not be a big deal for things like machine learning, but lots of computational processes actually are rather sensitive to rounding errors and benefit a lot from using 64 bit numbers, which is very slow on GPUs.

* Writing GPU specific functions can be quite labour intensive compared to writing CPU code. Julia's CUDA.jl and KernelAbstractions.jl packages does make a lot of things quite a bit nicer than in most languages, but it's still a lot of work to write good GPU code.

* Profiling and understanding the performance of GPU programs is typically a lot more complicated than CPU programs (even if there are some great tools for it!) because the performance model is just fundamentally more complex with more stuff going on and more random pitfalls and gotchas.