| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mark_l_watson 896 days ago

I could only run 2-bit q2 mode on my 32G M2 Pro. I was a little disappointed, but I look forward to try the new approach you linked. I just use Mistral’s and also a 3rd party hosting service for now.

After trying the various options for running locally, I have settled on just using Ollama - really convenient and easy, and the serve APIs let me use various LLMs in several different (mostly Lisp) programming languages.

With excellent resources from Hugging Face, tool providers, etc., I hope that the user facing interface for running LLMs is simplified even further: enter your hardware specs and get available models filtered by what runs on a user’s setup. Really, we are close to being there.

Off topic: I hope I don’t sound too lazy, but I am retired (in the last 12 years before retirement I managed a deep learning team at Capital One, worked for a while at Google and three other AI companies) and I only allocate about 2 hours a day to experiment with LLMs so I like to be efficient with my time.

2 comments

Casteil 896 days ago

Ollama[1] + Ollama WebUI[2] is a killer combination for offline/fully local LLMs. Takes all the pain out of getting LLMs going. Both projects are rapidly adding functionality including recent addition of multimodal support.

[1] https://github.com/jmorganca/ollama

[2] https://github.com/ollama-webui/ollama-webui

link

weiran 892 days ago

You should be able to run Q3 and maybe even Q4 quants with 32GB. Even with the GPU as you can up the max RAM allocation with: 'sudo sysctl iogpu.wired_limit_mb=12345'

link