FastGPT: Faster than PyTorch in 300 lines of Fortran

Y	Hacker News new \| ask \| show \| jobs

	FastGPT: Faster than PyTorch in 300 lines of Fortran (ondrejcertik.com)
	51 points by chl 1193 days ago

4 comments

frankreyes 1191 days ago

Fortran can produce faster code than C or C++ because it forces functions to not be aliasing for different arguments. https://stackoverflow.com/questions/146159/is-fortran-easier...

link

certik 1193 days ago

The author here. I am happy to answer any questions.

link

credit_guy 1193 days ago

Great job.

For someone who does not know Fortran, would you agree that the conclusion that can be drawn here is that PyTorch is good enough?

  > As you can see, fastGPT is slightly faster than PyTorch when doing as fair comparison as we can (both using OpenBLAS as a backend and both using caching, the default in PyTorch). You can also see that fastGPT loads the model very quickly and runs immediately, while both PyTorch and picoGPT take a long time to both load the model and to import all the Python libraries.

One one core from your benchmarks I see that the Fortran implementation is about 4% faster than the PyTorch one. On 4 cores about 13% faster.

Excluding the IO, the main advantage seems to come not from switching the language, but from using Apple's Accelerate framework. It appears to me that this framework is now available for Tensorflow [1] and PyTorch [2] too.

Do you expect that once you port the code to GPU you will see a significant improvement over the GPU version of PyTorch ?

[1] https://blog.tensorflow.org/2020/11/accelerating-tensorflow-...

[2] https://towardsdatascience.com/installing-pytorch-on-apple-m...

link

certik 1192 days ago

Excellent questions. One is import time and model loading time where PyTorch is very slow, and it gets much worse for the larger models, for the 1558M model PyTorch is 24s to start, while fastGPT is 1s, about 24x speedup.

I am still studying the performance of the inference itself, it's really hard to do meaningful benchmarks that I can trust. The ones in my blog posts should be solid, I've eventually managed to control all variables. For example, my faster tanh() implementation initially showed around 20% speedup, but after I controled everything, I only see 4% speedup without caching, and less than that with caching.

I think the main advantage of Fortran is that all I did was a rewrite (two afternoons) and I right away saw performance better than PyTorch, which is a highly optimized production code, developed by thousands of professionals. After controling for everything and doing a fair comparison, it's only slightly faster (at the moment!), but that's still quite an impressive result I think. And using Accelerate, it's a lot faster. I am guessing this problem is limited by matrix-matrix multiplication, in which case even Python is fast on single core (even pure Python/NumPy picoGPT is competitive after my PR), which the results seem to show.

Thanks for the links, I'll try PyTorch with Accelerate and report back.

I don't know regarding GPU, we'll have to see.

But in general, right now the code is not parallel, it runs on single core, the only parallelism comes from OpenBLAS. It's a great foundation to now parallelize it and see how it performs. In other words, with Fortran you start "fast" right away, and then you can try speeding it up from there. While in Python it is quite a lot of work to even get it to this performance.

link

jiehong 1193 days ago

The article mentions requiring help to use the GPU for compute: how easy is using the GPU for compute in Fortran? (Total newbie in Fortran)

link

Koshkin 1193 days ago

Just to chime in here: Intel's Fortran compiler (part of their oneAPI ecosystem) "provides CPU and GPU offload support."

https://www.intel.com/content/www/us/en/developer/tools/onea...

link

certik 1192 days ago

I don't have much GPU experience myself. As the sibling comment said, there are Fortran compilers that can offload to GPU, there is also Cuda Fortran. There is OpenMP offloading. I think LLVM can also target it somehow, and I would like to support it in LFortran, a compiler that we are developing. In general I am hoping people more experienced with GPU would be interested in helping out.

link

arbitrandomuser 1193 days ago

What happened to the theoretical physics reference book ?

link

certik 1192 days ago

It's here: https://github.com/certik/theoretical-physics/, I was hoping more people would contribute to the effort, but so far I didn't manage to spark enough interest. It's open source, it's out there and if I find at least a single person willing to contribute to get it polished, the development will pick up.

In the absence of it, I would have to drive it hard as my main effort, but right now my main effort is LFortran/LPython, we are making excellent progress there, so I am not spreading too thin until the compilers are delivered.

If anyone is interested in the theoretical physics book, please let me know!

link

lostmsu 1193 days ago

They use OpenBLAS. PyTorch is slower presumably due to its poor threading. Nothing special about Fortran here, any language without a global lock would have the same result.

link

jiehong 1193 days ago

Very nice to see Fortran in that space!

That page doesn’t display very well on mobile though (header seems correct, but text is too wide and you need to zoom out. Safari mobile)

link

certik 1192 days ago

Sorry about that. I was using some Hugo theme and I haven't checked on mobile. I should redo my webpage to be mobile friendly.

link