|
|
|
|
|
by smhx
3380 days ago
|
|
Currently not many frameworks have actual fusion of kernels (to avoid launching many GPU kernels). If you look underneath a theano.scan or TF.scan, GPU kernels are still being launched individually (but are likely stream-overlapped where appropriate). With TF's XLA compiler, they are slowly getting towards kernel fusion, which will then reduce launch overheads. We have similar things in the works for pytorch: to quickly JIT at runtime the dynamic graph that is getting executed. More news on this will come when time-appropriate. |
|
Also, have you looked at Numba to do the jitting? Probably best not to have yet another separately maintained python JIT.