| I was not aware that the PyMC developers have forked and continued Theano: https://github.com/pymc-devs/Theano-PyMC It seems very active right now. Here some further information: https://pymc-devs.medium.com/the-future-of-pymc3-or-theano-i... I haven't really found references to its new name "Aesara". Apparently, the main new feature for Theano will be the JAX backend. I wonder though, my experience when working with Theano, and also deep with the internals (trying to get further graph optimizations on theano.scan): - Some parts of the code are not really clean. - The code is extremely complex and hard to follow. See this: https://github.com/pymc-devs/Theano-PyMC/blob/master/theano/... - This also made it very complicated to perform optimizations on the graph. See this: https://github.com/pymc-devs/Theano-PyMC/blob/master/theano/... - In this specific case, it's also a problem of the API: theano.scan would return the whole sequence. But if you only need the last entry, i.e. y[-1], there is a very complicated optimization rule which checks for that. Basically many optimizations around theano.scan are very complicated because of that. - Here is one attempt for some optimization on theano.scan: https://github.com/Theano/Theano/pull/3640 - The graph building and esp the graph optimizations are very slow. This is because all the logic is done in pure Python. But if you have big graphs, even just building up the graph can take time, and the optimization passes will take much longer. This was one of the most annoying problems when working with Theano. The startup time to build the graph could easily take up some minutes. I also doubt that you can optimize this very much in pure Python -- I think you would need to reimplement that in C++ or so. When switching to TensorFlow, building the graph felt almost instant in comparison. I wonder if they have any plans on this in this fork. - On the other side, the optimizations on the graph are quite nice. You don't really have to care too much when writing code like log(softmax(z)) -- it will optimize it also to be numerically stable. - The optimizations also went so far to check if some op can work inplace on its input. Which made writing ops more complicated, because if you want to have nice performance, you would write two versions, one which works inplace on the tensor, and another one not. And then again 2 further versions if you want CUDA as well. |
In 1D convolutions, the in-place version would need to use O(filter size) scratch space for lookahead, but this doesn't seem like it would be too significant. However, it might start to become significant in higher-dimensional convolutions.
Any particular example that occurs to you?