It was clean-room implemented purely from the API surface and by trial-and-error with open CUDA code.