For someone who does not know Fortran, would you agree that the conclusion that can be drawn here is that PyTorch is good enough?
> As you can see, fastGPT is slightly faster than PyTorch when doing as fair comparison as we can (both using OpenBLAS as a backend and both using caching, the default in PyTorch). You can also see that fastGPT loads the model very quickly and runs immediately, while both PyTorch and picoGPT take a long time to both load the model and to import all the Python libraries.
One one core from your benchmarks I see that the Fortran implementation is about 4% faster than the PyTorch one. On 4 cores about 13% faster.
Excluding the IO, the main advantage seems to come not from switching the language, but from using Apple's Accelerate framework. It appears to me that this framework is now available for Tensorflow [1] and PyTorch [2] too.
Do you expect that once you port the code to GPU you will see a significant improvement over the GPU version of PyTorch ?
Excellent questions. One is import time and model loading time where PyTorch is very slow, and it gets much worse for the larger models, for the 1558M model PyTorch is 24s to start, while fastGPT is 1s, about 24x speedup.
I am still studying the performance of the inference itself, it's really hard to do meaningful benchmarks that I can trust. The ones in my blog posts should be solid, I've eventually managed to control all variables. For example, my faster tanh() implementation initially showed around 20% speedup, but after I controled everything, I only see 4% speedup without caching, and less than that with caching.
I think the main advantage of Fortran is that all I did was a rewrite (two afternoons) and I right away saw performance better than PyTorch, which is a highly optimized production code, developed by thousands of professionals. After controling for everything and doing a fair comparison, it's only slightly faster (at the moment!), but that's still quite an impressive result I think. And using Accelerate, it's a lot faster. I am guessing this problem is limited by matrix-matrix multiplication, in which case even Python is fast on single core (even pure Python/NumPy picoGPT is competitive after my PR), which the results seem to show.
Thanks for the links, I'll try PyTorch with Accelerate and report back.
I don't know regarding GPU, we'll have to see.
But in general, right now the code is not parallel, it runs on single core, the only parallelism comes from OpenBLAS. It's a great foundation to now parallelize it and see how it performs. In other words, with Fortran you start "fast" right away, and then you can try speeding it up from there. While in Python it is quite a lot of work to even get it to this performance.
I don't have much GPU experience myself. As the sibling comment said, there are Fortran compilers that can offload to GPU, there is also Cuda Fortran. There is OpenMP offloading. I think LLVM can also target it somehow, and I would like to support it in LFortran, a compiler that we are developing. In general I am hoping people more experienced with GPU would be interested in helping out.
It's here: https://github.com/certik/theoretical-physics/, I was hoping more people would contribute to the effort, but so far I didn't manage to spark enough interest. It's open source, it's out there and if I find at least a single person willing to contribute to get it polished, the development will pick up.
In the absence of it, I would have to drive it hard as my main effort, but right now my main effort is LFortran/LPython, we are making excellent progress there, so I am not spreading too thin until the compilers are delivered.
If anyone is interested in the theoretical physics book, please let me know!
They use OpenBLAS. PyTorch is slower presumably due to its poor threading. Nothing special about Fortran here, any language without a global lock would have the same result.