| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jmorgan 1089 days ago

Indeed, many tools in this space don't maximize resource utilization at runtime. Even the quantized models are massive resource hogs.. so you need all the performance you can get!

Ollama on macOS will use both the GPU and the Accelerator framework. It's build with the (amazing) llama.cpp project.

To run the 70B model you can try:

  ollama run llama2:70b

Note you'll most likely need a Mac with 64GB of shared memory and there's still a bit of work to do to make sure 70B works like a charm