https://cuda.juliagpu.org/stable/tutorials/introduction/#Wri...
With KernelAbstractions.jl you can actually target CUDA and ROCm:
https://juliagpu.github.io/KernelAbstractions.jl/stable/kern...
For python (or rather python-like), there is also triton (and probably others):
https://pytorch.org/blog/triton-kernel-compilation-stages/