| > But it can't in general build torch-tensorrt or flash-attn because it has no way of knowing if Mercury was in retrograde when you ran pip. This is a self-inflicted wound, since flash attention insist on building a native C++ extension which is completely unnecessary in this case. What you can do is the following: 1) Compile your CUDA kernels offline.
2) Include those compiled kernels in a package you push to pypi.
3) Call into the kernels with pure Python, without going through a C++ extension. I do this for the CUDA kernels I maintain and it works great. Flash attention currently publishes 48 (!) different packages[1], for different combinations of pytorch and C++ ABI. With this approach it would have to only publish one, and it would work for every combination of Python and pytorch. [1] - https://github.com/Dao-AILab/flash-attention/releases/tag/v2... |