Multiple GPUs support definitely belongs to the TODO list. However, I am currently limited by the state of CUDA.jl on this, as it does not have a device-aware memory pool yet.
I am also looking forward to CUDA.jl supporting f16 and int8 computations, which may enable another big speedup.
I am also looking forward to CUDA.jl supporting f16 and int8 computations, which may enable another big speedup.