|
Excellent questions. One is import time and model loading time where PyTorch is very slow, and it gets much worse for the larger models, for the 1558M model PyTorch is 24s to start, while fastGPT is 1s, about 24x speedup. I am still studying the performance of the inference itself, it's really hard to do meaningful benchmarks that I can trust. The ones in my blog posts should be solid, I've eventually managed to control all variables. For example, my faster tanh() implementation initially showed around 20% speedup, but after I controled everything, I only see 4% speedup without caching, and less than that with caching. I think the main advantage of Fortran is that all I did was a rewrite (two afternoons) and I right away saw performance better than PyTorch, which is a highly optimized production code, developed by thousands of professionals. After controling for everything and doing a fair comparison, it's only slightly faster (at the moment!), but that's still quite an impressive result I think. And using Accelerate, it's a lot faster. I am guessing this problem is limited by matrix-matrix multiplication, in which case even Python is fast on single core (even pure Python/NumPy picoGPT is competitive after my PR), which the results seem to show. Thanks for the links, I'll try PyTorch with Accelerate and report back. I don't know regarding GPU, we'll have to see. But in general, right now the code is not parallel, it runs on single core, the only parallelism comes from OpenBLAS. It's a great foundation to now parallelize it and see how it performs. In other words, with Fortran you start "fast" right away, and then you can try speeding it up from there. While in Python it is quite a lot of work to even get it to this performance. |