While much has changed since then, the architecture is effectively the same. Julia's native CUDA support simply boils down to compiling via the LLVM .ptx backend (Julia always generates LLVM IR, and the CUDA infrastructure "simply" retargets LLVM to .ptx, generates the binary, and then wraps that binary into a function which Julia calls), so it's really just a matter of the performance difference between the code generated by the LLVM .ptx backend vs the NVCC compiler.
While much has changed since then, the architecture is effectively the same. Julia's native CUDA support simply boils down to compiling via the LLVM .ptx backend (Julia always generates LLVM IR, and the CUDA infrastructure "simply" retargets LLVM to .ptx, generates the binary, and then wraps that binary into a function which Julia calls), so it's really just a matter of the performance difference between the code generated by the LLVM .ptx backend vs the NVCC compiler.