I didn't do a pure python, mine uses numpy, and although I haven't benchmarked, it runs the stories15M model much faster than 1.3 tok/sec on my 2018 macbook. You should try swapping in numpy matrix multiplication, or @ (I actually don't know if that's native or part of another package) for matmul and see what changes.
The llama2.py code defines its own accum, rmsnorm and matmul. Why not use NumPy? A "pure Python" code that is much slower than one using NumPy is less interesting to me.
If your goal is to make it as fast as possible, then for sure Python implementation is not a solution here. I think for this exactly reason llama.cpp got high attention
I find these efforts impressive, but what is the value proposition here? (I'm not just talking about this fork, but also Karapathy's llama2.c as well).
Personally for me the value was to implement a complex logic from a scientific paper in a pure Python.
It helps to understand the essence of a cutting edge AI technology.
And it's quite fascinating that it would take about 500 lines of core part code to implement inference for such a complex solution.
Regarding the original llama2.c as I believe the value proposition is to have simple implementation that can execute the inference locally on wide variety of platforms. What if we can execute fine-tuned Llama7B on our phones?
I didn't do a pure python, mine uses numpy, and although I haven't benchmarked, it runs the stories15M model much faster than 1.3 tok/sec on my 2018 macbook. You should try swapping in numpy matrix multiplication, or @ (I actually don't know if that's native or part of another package) for matmul and see what changes.