Since it's 12x faster than real time on a 4090, I wonder how fast would it be on a small form factor device (a SBC); I get it as this is using CUDA, so I really wonder how would that perform on my nV Xavier NX (and the more common Nano's out there)...!
I think it should work pretty good with the Apple's MLX framework as well if anyone would be willing to convert it. :)