"Currently, I am working on [...] direct CUDA implementation, which will be significantly faster and probably come close to PyTorch."