The llama2.py code defines its own accum, rmsnorm and matmul. Why not use NumPy? A "pure Python" code that is much slower than one using NumPy is less interesting to me.
If your goal is to make it as fast as possible, then for sure Python implementation is not a solution here. I think for this exactly reason llama.cpp got high attention