Karpathy's llama2.c ported to pure Python

Y	Hacker News new \| ask \| show \| jobs

	Karpathy's llama2.c ported to pure Python (github.com)
	6 points by atairov 1045 days ago

3 comments

andy99 1045 days ago

I made a jupyter notebook "llama2.ipynb" from the Karpathy project: https://github.com/rbitr/llama2.ipynb

I didn't do a pure python, mine uses numpy, and although I haven't benchmarked, it runs the stories15M model much faster than 1.3 tok/sec on my 2018 macbook. You should try swapping in numpy matrix multiplication, or @ (I actually don't know if that's native or part of another package) for matmul and see what changes.

link

atairov 1045 days ago

1.3 tok / sec is something similar to my Python version port performance, but I tried on M1 Max

link

Bostonian 1045 days ago

The llama2.py code defines its own accum, rmsnorm and matmul. Why not use NumPy? A "pure Python" code that is much slower than one using NumPy is less interesting to me.

link

atairov 1045 days ago

If your goal is to make it as fast as possible, then for sure Python implementation is not a solution here. I think for this exactly reason llama.cpp got high attention

link

behnamoh 1045 days ago

I find these efforts impressive, but what is the value proposition here? (I'm not just talking about this fork, but also Karapathy's llama2.c as well).

link

atairov 1045 days ago

Personally for me the value was to implement a complex logic from a scientific paper in a pure Python. It helps to understand the essence of a cutting edge AI technology. And it's quite fascinating that it would take about 500 lines of core part code to implement inference for such a complex solution.

link

atairov 1045 days ago

Regarding the original llama2.c as I believe the value proposition is to have simple implementation that can execute the inference locally on wide variety of platforms. What if we can execute fine-tuned Llama7B on our phones?

link

brucethemoose2 1045 days ago

> What if we can execute fine-tuned Llama7B on our phones?

7B and 13B are already quite performant with mlc-llm (which uses an Apache TVM Vulkan/Metal backend). Llama.cpp has the potential to perform well too.

These "single file" implementations are not meant to be optimized or feature rich, I dont think.

link

brucethemoose2 1045 days ago

Its educational. It shows a how llama works in a clear, concise, testable way.

link

westurner 1045 days ago

Writing one's own and/or porting every line of code has great value

link