Hacker News new | ask | show | jobs
by jmorgan 1041 days ago
Indeed, many tools in this space don't maximize resource utilization at runtime. Even the quantized models are massive resource hogs.. so you need all the performance you can get!

Ollama on macOS will use both the GPU and the Accelerator framework. It's build with the (amazing) llama.cpp project.

To run the 70B model you can try:

  ollama run llama2:70b
Note you'll most likely need a Mac with 64GB of shared memory and there's still a bit of work to do to make sure 70B works like a charm