|
|
|
|
|
by credit_guy
1193 days ago
|
|
Great job. For someone who does not know Fortran, would you agree that the conclusion that can be drawn here is that PyTorch is good enough? > As you can see, fastGPT is slightly faster than PyTorch when doing as fair comparison as we can (both using OpenBLAS as a backend and both using caching, the default in PyTorch). You can also see that fastGPT loads the model very quickly and runs immediately, while both PyTorch and picoGPT take a long time to both load the model and to import all the Python libraries.
One one core from your benchmarks I see that the Fortran implementation is about 4% faster than the PyTorch one. On 4 cores about 13% faster.Excluding the IO, the main advantage seems to come not from switching the language, but from using Apple's Accelerate framework. It appears to me that this framework is now available for Tensorflow [1] and PyTorch [2] too. Do you expect that once you port the code to GPU you will see a significant improvement over the GPU version of PyTorch ? [1] https://blog.tensorflow.org/2020/11/accelerating-tensorflow-... [2] https://towardsdatascience.com/installing-pytorch-on-apple-m... |
|
I am still studying the performance of the inference itself, it's really hard to do meaningful benchmarks that I can trust. The ones in my blog posts should be solid, I've eventually managed to control all variables. For example, my faster tanh() implementation initially showed around 20% speedup, but after I controled everything, I only see 4% speedup without caching, and less than that with caching.
I think the main advantage of Fortran is that all I did was a rewrite (two afternoons) and I right away saw performance better than PyTorch, which is a highly optimized production code, developed by thousands of professionals. After controling for everything and doing a fair comparison, it's only slightly faster (at the moment!), but that's still quite an impressive result I think. And using Accelerate, it's a lot faster. I am guessing this problem is limited by matrix-matrix multiplication, in which case even Python is fast on single core (even pure Python/NumPy picoGPT is competitive after my PR), which the results seem to show.
Thanks for the links, I'll try PyTorch with Accelerate and report back.
I don't know regarding GPU, we'll have to see.
But in general, right now the code is not parallel, it runs on single core, the only parallelism comes from OpenBLAS. It's a great foundation to now parallelize it and see how it performs. In other words, with Fortran you start "fast" right away, and then you can try speeding it up from there. While in Python it is quite a lot of work to even get it to this performance.