| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bekantan 897 days ago

> The output quality is not "ruined" at all.

That was my experience as well - 3-bit version is pretty good.

I also tried 2-bit version, which was disappointing.

However, there is a new 2-bit approach in the works[1] (merged yesterday) which performs surprisingly well for Mixtral 8x7B Instruct with 2.10 bits per weight (12.3 GB model size).

[1] https://github.com/ggerganov/llama.cpp/pull/4773

2 comments

mark_l_watson 896 days ago

I could only run 2-bit q2 mode on my 32G M2 Pro. I was a little disappointed, but I look forward to try the new approach you linked. I just use Mistral’s and also a 3rd party hosting service for now.

After trying the various options for running locally, I have settled on just using Ollama - really convenient and easy, and the serve APIs let me use various LLMs in several different (mostly Lisp) programming languages.

With excellent resources from Hugging Face, tool providers, etc., I hope that the user facing interface for running LLMs is simplified even further: enter your hardware specs and get available models filtered by what runs on a user’s setup. Really, we are close to being there.

Off topic: I hope I don’t sound too lazy, but I am retired (in the last 12 years before retirement I managed a deep learning team at Capital One, worked for a while at Google and three other AI companies) and I only allocate about 2 hours a day to experiment with LLMs so I like to be efficient with my time.

link

Casteil 896 days ago

Ollama[1] + Ollama WebUI[2] is a killer combination for offline/fully local LLMs. Takes all the pain out of getting LLMs going. Both projects are rapidly adding functionality including recent addition of multimodal support.

[1] https://github.com/jmorganca/ollama

[2] https://github.com/ollama-webui/ollama-webui

link

weiran 892 days ago

You should be able to run Q3 and maybe even Q4 quants with 32GB. Even with the GPU as you can up the max RAM allocation with: 'sudo sysctl iogpu.wired_limit_mb=12345'

link

coder543 896 days ago

That is a very interesting discussion. Weird to me that the quantization code wasn’t required to be in the same PR. Ika is also already talking about a slightly higher 2.31bpw quantization, apparently.

link