Hacker News new | ask | show | jobs
by _w1tm 1031 days ago
Yes. I recently benchmarked the 70B Llama 2 model on a 24 vCPU vSphere host with 64GB RAM (through Ollama) and it was capable of spitting out ~0.15 tokens / second. Useless for any interactive use-case but better than nothing. As a comparison the 7B Llama 2 model was ~1.5 tokens / second on the same hardware while the cheapest M1 MacBook Air can do ~10 tokens / second thanks to GPU acceleration.