| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tarruda 1144 days ago
	3 tokens/sec is a lot faster than what I experienced. Even though your CPU has a lot more cores, I think llama.cpp was not being able to make good use of more than 8 threads. When did you test this? Maybe llama.cpp had some improvements since I used it (which was at the start of the project).

2 comments

Ambix 1143 days ago

It's not about threads number, it about memory bottleneck. Sweet spot for my M1 Pro laptop is around 6 threads and 4bit model - I've managed to get 20 tokens per sec, really impressive

link

logicchains 1144 days ago

I tested this on the latest master. Llama.cpp has had some performance improvements, although I don't know if that'd be enough to make it 3x faster.

link