|
|
|
|
|
by bekantan
897 days ago
|
|
> The output quality is not "ruined" at all. That was my experience as well - 3-bit version is pretty good. I also tried 2-bit version, which was disappointing. However, there is a new 2-bit approach in the works[1] (merged yesterday) which performs surprisingly well for Mixtral 8x7B Instruct with 2.10 bits per weight (12.3 GB model size). [1] https://github.com/ggerganov/llama.cpp/pull/4773 |
|
After trying the various options for running locally, I have settled on just using Ollama - really convenient and easy, and the serve APIs let me use various LLMs in several different (mostly Lisp) programming languages.
With excellent resources from Hugging Face, tool providers, etc., I hope that the user facing interface for running LLMs is simplified even further: enter your hardware specs and get available models filtered by what runs on a user’s setup. Really, we are close to being there.
Off topic: I hope I don’t sound too lazy, but I am retired (in the last 12 years before retirement I managed a deep learning team at Capital One, worked for a while at Google and three other AI companies) and I only allocate about 2 hours a day to experiment with LLMs so I like to be efficient with my time.