|
|
|
|
|
by marmaduke
2073 days ago
|
|
> From a Python programmer perspective, how does CUDA.jl compare to PyCUDA? I think the relevant comparison today is with Numba, here's a real world recurrence analysis, @cuda.jit
def _sseij(Y, I, J, O):
# strides
sty = cuda.blockDim.x
sbx = sty * cuda.blockDim.y
sby = sbx * cuda.gridDim.x
sbz = sby * cuda.gridDim.y
# this thread's index
t = (cuda.threadIdx.x
+ cuda.threadIdx.y * sty
+ cuda.blockIdx.x * sbx
+ cuda.blockIdx.y * sby
+ cuda.blockIdx.z * sbz)
if t < I.size:
i = I[t]
j = J[t]
x = nb.float32(0.0)
for k in range(Y.shape[0]):
x += (Y[k, i] - Y[k, j])**2
O[t] = math.sqrt(x)
most of it is index calculation, but super easy |
|
C isn't usually a breath of fresh air, but the last two times I tried to use nubacuda and failed (~1.5 year ago), it sure felt that way.
EDIT: yes, looks like it supports on-device buffers now. I'm still in "once bitten, twice shy" mode on account of the bugs and debug story, but I'm cautiously optimistic.