Hacker News new | ask | show | jobs
by qudat 1041 days ago
I just wanted to call out that some of these quick-to-start tools are CPU only (eg ollama) which is great to play with but if you want your GPU you’ve gotta go to llama.cpp

Further, the 70B for llama.cpp is still under development as far as I know.

3 comments

Indeed, many tools in this space don't maximize resource utilization at runtime. Even the quantized models are massive resource hogs.. so you need all the performance you can get!

Ollama on macOS will use both the GPU and the Accelerator framework. It's build with the (amazing) llama.cpp project.

To run the 70B model you can try:

  ollama run llama2:70b
Note you'll most likely need a Mac with 64GB of shared memory and there's still a bit of work to do to make sure 70B works like a charm
I am using ollama today on a MacBook Pro M1Max with 64GB. Using a llama2 70b model, I am getting about 7 tokens/second with the onboard gpu. Before ollama used gpu, that was much slower. To compare, the 7b model gets me closer to 55 tokens/second. There is no way it could achieve those numbers without the gpu.
70B llama.cpp works now. You need the temporary `-gqa 8` flag for 70B.

You can even extend context with RoPE!