| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by abetlen 1205 days ago
	You can see for yourself (assuming you have the model weights) https://github.com/abetlen/llama-cpp-python I get around ~140 ms per token running a 13B parameter model on a thinkpad laptop with a 14 core Intel i7-9750 processor. Because it's CPU inference the initial prompt processing takes longer than on GPU so total latency is still higher than I'd like. I'm working on some caching solutions that should make this bareable for things like chat.