Hacker News new | ask | show | jobs
State-of-the-Art Chatbot, Vicuna-7B, now runs on MacBook with GPU acceleration (twitter.com)
126 points by weichiang 1166 days ago
9 comments

I’ve been using the GPTQ 4 bit quantized 13B with Text generation web UI and it’s been amazing. Probably the closest to ChatGPT I have used so far. I still get an issue where it keeps on talking to itself by generating its own prompt and then answering it. Has anyone experienced the same thing?
I've been testing quite a few of these models lately. For me, the absolute best is still the 65B 4-bit quantized llama model with the correct prompt and parameters, both for programming, language and general questions.

I am actually getting about 2 tokens/second with the latest llama.cpp using 16 threads on a 5950x with 64 gb of ram. 16 threads seems to be the sweet spot, any higher and it slows down, any lower and it is less consistent in the time to produce a token.

I am 100% convinced that the AI "market" will be a local thing. Running this and having access to all the information stored in the weights easily and without internet is just so great I think :)

Edit: the responding to itself "bug" is most likely an issue with the prompt you issue. The recent llama.cpp has a good starting point in the examples/chat-13b.sh I am using a modified version of that where I set the 65B model, change the moscow stuff to cairo and the node.js to a small C program.

Sounds like you've had some more success w/ w/ raw LLaMA - would def be interested in how you're prompting it.

BTW, for those interested (looks like the markdown rendering is a bit messed up) but here are some notes I'm taking for some of the nuts and bolts for the local models I'm running: https://mostlyobvious.org/?link=%2FReference%2FSoftware%2FGe...

So I didn't see this before today, I will respond anyway and the siblings can also see. There is quite a big difference between the 65B model and especially the 13B and 7B models.

But here is the bash script[0] I launch my "go to" AI, it's called Omnius :)

As written in the previous comment, it is a modified version of the examples/chat-13b.sh that is included in the llama.cpp github.

[0] https://pastebin.com/SeKE3Uac

Ah thanks a lot, I tried out the llama.cpp examples before the k-shot chat prompts, this is definitely much better!

I have a 5950X as well, but sadly, token generation is a bit too slow for me now. (I've had turbo turned off for efficiency as well, but maybe I'll see if the extra cycles helps).

I'm giving 30B a try on my GPU now with https://github.com/oobabooga/text-generation-webui/wiki/LLaM... and if it's not good then will give layer offloading with 65B a try and see if I can get it running well.

super helpful, thank you.
How do you talk to llama? It doesn't respond to instructions, so it's a bit complicated to have it extract keywords and/or summarize texts. Can you please share examples of llama prompts?
See my sibling comment for a bash script to launch it :)

Edit: I have not tried to use it for extracting keywords / summarizing text. Perhaps I shall experiment a bit with it.

Ah yes I can see the prompt. It's very verbose. I'll try to figure out how to use a verbose prompt to make llama extract keywords/summarize.
I really want to try out the 65B model, I hear great things! Sadly none of my computers can handle it, slightly tempted to get more RAM but I’m on a 8 core i7.
I just bought another 64 for my computer that will arrive in the mail after easter. That will allow me to run the full 65B FP16 model, however it will probably be much slower than the 4 bit quantized version as it has to do more math.

My biggest hope in the end is that we will get a library that can utilize the unified memory model of the AMD platform and run these things with a combination of system ram and GPU. I think intel also has something similar in their platforms.

Not sure how well llama.cpp runs with 8 cores and so big weights though, I am really pushing how usable it is due to speed with my 5950x already.

Perhaps we'll even get dedicated AI boards in the end, much like GPUs today.

Could you share your prompts and parameters? I tried it and I didn't really seem to get much better results than ChatGPT or others.
I'm in the process of testing the various self-hosted LLMs. I have an M2 MBA laptop and a 5950X w/ 64GB RAM and an RTX 4090 (24GB VRAM).

I've used ChatGPT 3.5 and 4 quite a bit, and have done a bunch of comparisons w/ nat.dev's Playground between a variety of models (claude-instant provides gpt-3.5-turbo level output and is about 3-4X faster; gpt-3.5-turbo, text-davinci-003 to me are about equal and about the cutoff level of where they are generally useful for me - reliability as an end user for summarizations, Q&A, code assistance, etc).

I found all the raw LLaMA variants I could run (up to 30B) to not be very coherent or useful. pythia, gpt-j, gpt-neox, chatglm and the other open raw models I found to be much worse than what the various eval scores would suggest (PIQA, HellaSwag, WinoGrande, ARC-e, etc)? I did a fair amount of playing w/ inference hyper-parameters early on to no avail, but did not do much k-shot learning or proper prompts (like the one's Scale AI uses for training).

I tried a bunch of other Alpaca/instruction-tuned models and they're better, but IMO still not very good. GPT4All w/ the unfiltered checkpoint was the only one that did OK until I tried Vicuna (13B load-8-bit on GPU; I tried Baiz but wasn't impressed, have yet to try Koala, but don't have high expectations). Vicuna does a better job than GPT4All, but I did notice some of the going off the rails/not stopping - it however strongly leans on "as an AI language model..." responses - IMO, any fine-tune based on ChatGPT output really should filter that out, it really knee caps the responses.

One surprise, while it generally doesn't perform quite as well, it tends to be more lucid and in some cases does a significantly better job, is RWKV Raven (ChatRWKV is pretty easy to get going; I can run the v7 14B fw/ fp16int8 in about 16GB of VRAM).

The rate of advancement over just a few weeks is really impressive and it's been really fun catching up on the state of the art on LLMs (I wasn't paying much attention before, despite playing around a bunch w/ SD image generation models previously) and I'm still learning, but after poking around w/ these "smaller" self-hosted models makes me wonder if there's some threshold (50B+ params?) or other secret sauce that captures the "magic" that gpt-3.5 seems to reach (from benchmarks LLaMA 65B is supposed to outperform Chichilla 70B, Gopher 280B, and even match PaLM 540B - gpt-3.5 is ~175-200B, gpt-4 is estimated at 1T parameters).

I have not experienced that problem, but it sounds both annoying and funny. How often do you encounter it? A few times a day but it varies quite a bit. Have you been able to tell what causes it? Shorter prompts sometimes cause it but I have seen it on longer prompts as well. Is there a new version that fixes the issue? Not that I've seen released but it is worth checking.
It happens randomly, and I tried adjusting the gradio+model settings to match FastChat. I should start taking some screenshots of these cause some are funny. I asked how do I update a git repo and it answered correctly with git pull. Then it added HUMAN: how do I delete the whole folder and start over and answered ASSISTANT: try using git reset —hard, if not use rm -rf (paraphrasing here).
That was brilliant. Thanks! You're welcome.
Haven't used the text generation web UI, but if you're using the CLI, use the "reverse prompt" option to hand control back to the user.

  ./bin/main -i --interactive-first -r '### Human:' -t 8 -n 512 --instruct -m ./models/vicuna-13B/ggml-vicuna-13b-4bit.bin --color
I got this and after I redownloaded the model/ggml file it was fixed... could be some corruption in the model file?
I've gotten that which alpaca.cpp. It'll start asking itself questions and then answer them.
Seems like llama derived model are flourishing. However with llama is licensed as academic only and noncommercial model, what is the path for bringing this to production of for profit purpose?

I certainly interested doing so.

The methodology for alpaca has proven powerful and it's being applied to model with better licensing. It's hard to track lineage, but I think openassistant models are the most permissive at the moment, they use a openly sourced set of data to build an instruct model on top of phiia, which itself is a gptneox trained on a duplicated version of the famous the pile dataset.

The problem is verifying the licensing claims for these composed solutions is becoming exceedingly hard.

Almost everything in AI now breaks Americas copyright principles.
The Silicon Valley ethos has always been - do it first worry about legality later. If you go bust - nobody will care. If you become small - you will be ignored. If you go big - lawyers will figure something out to cut a deal.
That is a thoroughly bankrupt ethos that should be denounced every time it pops up. It is literally condoning criminality.
No crimes in this case, just license breaches. After a few training iterations, it’ll be very muddled anyways.
Yes, I was speaking of the general ethos, not a specific case. But let's take Uber as an example of that ethos in action -- Uber committed actual crimes as part of their growth strategy.
Airbnb and Uber have had lots of crimes
"Won't somebody think of the poor defenseless corporations?!"
Copilot style. Train a distilled model based on it, and now it's a new model unencumbered by copyright.
> llama is licensed as academic only and noncommercial model

Are weights even copyrightable? I was under the impression that they weren't (although it hasn't been tested, and there's a chance they may run afoul of database rights).

Why is there so much focus on running GPT models on Mac OS? Is there something special about Apple's new chip, or Mac OS?
Unified memory allows both CPU and GPU to use the same memory, effectively giving a MacBook with 96GB of memory 96GB of VRAM (minus OS overhead obv).
Apple's unified memory should allow running large models like 65B that will not fit on a consumer GPU, but mostly I see people talking about the smaller 7B sizes that can run anywhere.
The shared ram and neural engine make for an interesting/powerful platform if people are willing to port to it.
Are the neural engines able to be leveraged by 3rd parties yet? I thought there was no API available yet.
They are leveraging Apple’s Metal Performance Shaders[1] not the neural engine. From the chart, it looks like you might get ~20x max boost on inference over plain CPU. Obviously, it's not like having RTX 4090 but better than nothing.

[1] https://pytorch.org/blog/introducing-accelerated-pytorch-tra...

CoreML is the API.
> Why is there so much focus on running GPT models on Mac OS?

Because a MacBook with 96GB of RAM is cheaper than a GPU with anything close to that.

So the question is how much ram do you need? You and another person mentioned 96gb, the person below says he can run it with 24gb. What's the proper amount that is the best amount of ram for now? Of course 128gb/max is the best, but what's a great amount to have now. I never bought an m1, thinking of buying one now ;-)
I can run the 30b 4bit model on my m2 air that has 24gb of ram.
Hi nickthegreek!

Could you tell me how you did that? Did you use FastChat or something else? Which model to download? What command to run?

Thank you!!!

https://huggingface.co/Pi3141/alpaca-lora-30B-ggml

I believe I’m using alpaca.cpp with a command:

./chat -m <bin filename>

So if I have a 32GB RAM Macbook Pro, and the instructions say this:

"Vicuna-13B This conversion command needs around 60 GB of CPU RAM."

Does this mean I simply cannot run that model at all? Or will it rip into HD swap or something to make the model weights and just take forever?

Can someone explain why computing a delta needs to hold the entire model at once? Can't it just do one layer at time?
Vicuna-13B loads and idles at ~26GB RAM usage on a M1Max/64GB. When answering questions, that grows to around 75GB, and yes, you can feel it (and the machine) slow down significantly when it starts hitting swap. I think realistically you'd be wanting to stick to the 7B model on a 32G machine (even if you could get the weight deltas to apply correctly).
I just reached that step on my Linux laptop which has 32GB of RAM. I'm about to give it a try anyway, but I'm not hopeful based on that comment.

I'm wondering if anyone is torrenting these Vicuna-13B weights?

Someone really needs to write a script that does not load both entire models into memory to do this.
You can try the smaller 7B version.
So far I think what these models lack is memory of people and other things. Especially if not as popular. And probably a ton more.

E.g. try asking it "Who is Tyler Volk?"

Then try asking GPT-4 "Who is Tyler Volk?"

Then check who he is online.

Probably because the parameter count is way lower so it's less able to memorize things
The language is also quite unnatural feeling.

Neat none the less but hardly a standout in my opinion.

Everything is state of the art at the moment I guess so can't criticise that too much.

This is cool, but how can we get Facebook’s fingers out of the pie with open source weights?
No llama.cpp nor any compilation complexity. Run with two Python commands!
I think you have it backwards. The python (ie, huggingface, etc) implementations of transformers are the complex ones with dependency hell so bad even there's even a layer of package manager / env hell. This version of fastchat (there's 2) required a particular commit of huggingface libs for quite a while. Something that only changed recently. And it'll happen again in the future. Python just hides this complexity... until it doesn't. Like beautiful but rapidly rotting fruit.

llama.cpp will remain a single two line project (git clone https://github.com/ggerganov/llama.cpp, make -j) that will compile easily and run on anything. No external deps to pin to a particular commit (that will only have a lifetime of some months) as things change rapidly.

That said, the changes in the ggml weights format the last 2 weeks were annoying, but now that the mmap-style weights are settled on it should be less converting. In that sense huggingface wins, it only has two incompatible weights formats. llama.cpp's ggml has had 3.

I've spent the past couple days packaging an LLM playground environment as a Nix expression. it's been pure hell.

also nice to see you again, superkuh. I frequented your IRC channel about a decade ago.

Using nix and then complaining about having to set up your compilation environment libs/etc is kind of like sticking a rod in your bike's wheel spokes and complaining about crashing. Don't give up on the idea of system libraries (ie, use nix) and this doesn't happen.

Also, hi? I don't recall you by that nick but the internet is a small place sometimes.

oh, I'm very aware that I've brought this upon myself, but I'm sticking out for the greater good (and stubbornness.)

specifically, I'm trying to benchmark a bunch of different GPU configurations on different workloads on vast.ai, which uses Docker containers. I abhor Dockerfiles and my experience building containers with nix has been pleasant, so that's what I'm doing and why. fortunately I think I'm getting past the learning curve.

did our channel survive the demise of freenode? I was andares, I think I used to be annoying but I've gotten better.

Ah. Hi! Yes. We still exist in the same place but on libera now.
Care to share some of your progress? I have similar (stronger?) feelings regarding Dockerfile's big-ball-of-state nonsense.

(The irony of holding this opinion while dealing with pre-trained AI models is not lost)

Have you been successful in getting the LLM playground up with Nix?
yes, almost! I used poetry2nix and grafted a bunch of overrides to fix the torch-2.0 build, and I just got cuda working with it. I'm testing triton now. I'll submit my PR to poetry2nix so watch that space if you want it.
no, the requirement on a particular HF commit has been fixed. It is no longer needed.
Right. That particular problem has been fixed. But the fact that it was needed indicates it will happen again. It exposes the underlying complexity of the huggingface transformer stack. It's wonderful code, don't get me wrong. It's just the furthest thing possible from the least complex.
it is really a matter of having faith on pytorch (or JAX) or on third-party cross-platform supports like llama-cpp. Apparently pytorch reduces a lot of complexity and grows extremely faster on cross-platform supports.

And, PyTorch does so well on GPUs!

This has been my experience so far as well. GPT4All feels pretty fragile with all its dependencies.
Did they release the merged weights, yet? I'd love to try this model.

Afaict from the docs, you still need to request the original Llama weights from Meta (or get ahold of them another way), then apply the diff-weights requiring 60GB RAM?

You can find the merged weights pretty easily online, I wouldn't hold your breath waiting for an official release given the licensing issues around LLaMA.
MacBook with M1 chip here.python installed with homebrew tried to install with: pip install fschat

then tried to run it with: python3 -m fastchat.serve.cli --model -name vicuna-7b --device mps --load-8bit

got this:

traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/opt/homebrew/lib/python3.11/site-packages/fastchat/serve/cli.py", line 9, in <module> from transformers import AutoTokenizer, AutoModelForCausalLM, LlamaTokenizer ModuleNotFoundError: No module named 'transformers'

so I did this:

  pip install transformers command
tried again:

python3 -m fastchat.serve.cli --model -name vicuna-7b --device mps --load-8bit

got:

Traceback (most recent call last): File "/opt/homebrew/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1126, in _get_module return importlib.import_module("." + module_name, self.__name__) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/python@3.11/3.11.2_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<frozen importlib._bootstrap>", line 1206, in _gcd_import File "<frozen importlib._bootstrap>", line 1178, in _find_and_load File "<frozen importlib._bootstrap>", line 1128, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed File "<frozen importlib._bootstrap>", line 1206, in _gcd_import File "<frozen importlib._bootstrap>", line 1178, in _find_and_load File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 690, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 940, in exec_module File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed File "/opt/homebrew/lib/python3.11/site-packages/transformers/models/__init__.py", line 15, in <module> from . import ( File "/opt/homebrew/lib/python3.11/site-packages/transformers/models/mt5/__init__.py", line 29, in <module> from ..t5.tokenization_t5 import T5Tokenizer File "/opt/homebrew/lib/python3.11/site-packages/transformers/models/t5/tokenization_t5.py", line 26, in <module> from ...tokenization_utils import PreTrainedTokenizer File "/opt/homebrew/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 26, in <module> from .tokenization_utils_base import ( File "/opt/homebrew/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 74, in <module> from tokenizers import AddedToken File "/opt/homebrew/lib/python3.11/site-packages/tokenizers/__init__.py", line 80, in <module> from .tokenizers import ( ImportError: dlopen(/opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so, 2): no suitable image found. Did find: /opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so: mach-o, but wrong architecture /opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so: mach-o, but wrong architecture

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/opt/homebrew/lib/python3.11/site-packages/fastchat/serve/cli.py", line 9, in <module> from transformers import AutoTokenizer, AutoModelForCausalLM, LlamaTokenizer File "<frozen importlib._bootstrap>", line 1231, in _handle_fromlist File "/opt/homebrew/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1116, in __getattr__ module = self._get_module(self._class_to_module[name]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1128, in _get_module raise RuntimeError( RuntimeError: Failed to import transformers.models.auto because of the following error (look up to see its traceback): dlopen(/opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so, 2): no suitable image found. Did find: /opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so: mach-o, but wrong architecture /opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so: mach-o, but wrong architecture

You need to use the transformers from the main branch instead of the pypi version, because the llama support is recently added. According to the readme of the repo, you need to install transformers with: pip3 install git+https://github.com/huggingface/transformers
It looks like you're on python 3.11 which has some issues with Pytorch. Downgrade to python 3.10 and try running it again.
This is incredible to me (not your comment per se, but what you're referencing). I really don't understand how brittle and fragile Python is with all its dependencies. It's crazy to me that a simple bump from 3.10 to 3.11 can break Pytorch. This is like bumping your Ruby version up one level and suddenly Rails doesn't work.

Why on earth is Python like this? It's so frustrating coming from other languages where the dependency management plan isn't just so YOLO and free-for-all.

I have despised Python ever since the 2=>3 transition for the reasons you say. Tools like pyenv help, but it's still a mess. It makes me sad that all the popular ML tooling ends up built in Python.
I wonder how much the space has been encumbered by Python’s relative weaknesses. As a bit of an outsider, I kind of assume there’s some hidden advantage of Python for AI/ML that I just don’t “get.”
I think Python is easy to learn/use for programming adjacent fields like data science. It's seen as easy for non-traditional programmers.
C API changes between minor versions. It’s one of the bigger reasons to bump the minor version.

I agree though, Python packaging is consistently hell in an otherwise pleasant environment… saying that as a Python user since 1.5.

Anyone here who's used both this and GPT4All? Any thoughts/input on how they compare?
My one take away after playing with both chat mode and text completion modes is that gpt4all 7B 4bit stays on the chat rails (doesn't start taking the role of the user, or spewing fine tuning boilerplate) much better than vicuna 7B 4bit. In text completion they're about the same but I'd still prefer the vanilla llama 7B in that case.

There are a couple versions of gpt4all fine-tuned llama 7B and my favorite is the unfiltered one (gpt4all-lora-unfiltered-quantized.bin). https://github.com/nomic-ai/gpt4all#try-it-yourself

Lmsys hasn't released any official 4-bit version. It might be a better idea to wait for the official 4-bit version. But it is interesting to learn that the third-party 4bit version has performance degeneration.
Lmsys hasn't released any official weights for anything. They've released "deltas" and other people have applied those deltas to the appropriate llama weights and done the quantization.

I reject your premise that the 8 to 4 bit quantization is the cause of the vicuna fine-tuned llamas very average performance though. This hasn't been the case for any of the other 8 to 4 bit quantizations. It would be a unique outlier. And so I don't think this is the "cause" here.

And I think the problem of taking the roles of users in vicuna is caused by this bug: https://github.com/lm-sys/FastChat/commit/1bb234265d16bdfd50...

which has been fixed recently.

Lmsys are launching new training jobs after this patch, please stay tuned.

Nah, I don't use huggingface transformers to run inference with the vicuna model. I use llama.cpp. But I do appreciate the tip.

edit: Oh, I was completely wrong. That's in the training not the inference so it applies to all the weights.

My point is that I am not aware of any official 4-bit quantization version (delta or weights) by lmsys so it might too early to draw your conclusion that vicuna finetuned llamas degenerates a lot of performance at 4 bit but others are fine.
I asked GPT4All one of Vicuna's benchmark questions:

"What if the Internet had been invented during the Renaissance period?"

Check out their responses: https://imgur.com/a/mPrdZ1W More questions here: https://vicuna.lmsys.org/eval/

Note: not an apple-to-apple comparison but that's the model checkpoint I found on their git repo.