I’ve been using the GPTQ 4 bit quantized 13B with Text generation web UI and it’s been amazing. Probably the closest to ChatGPT I have used so far. I still get an issue where it keeps on talking to itself by generating its own prompt and then answering it. Has anyone experienced the same thing?
I've been testing quite a few of these models lately.
For me, the absolute best is still the 65B 4-bit quantized llama model with the correct prompt and parameters, both for programming, language and general questions.
I am actually getting about 2 tokens/second with the latest llama.cpp using 16 threads on a 5950x with 64 gb of ram. 16 threads seems to be the sweet spot, any higher and it slows down, any lower and it is less consistent in the time to produce a token.
I am 100% convinced that the AI "market" will be a local thing. Running this and having access to all the information stored in the weights easily and without internet is just so great I think :)
Edit: the responding to itself "bug" is most likely an issue with the prompt you issue. The recent llama.cpp has a good starting point in the examples/chat-13b.sh
I am using a modified version of that where I set the 65B model, change the moscow stuff to cairo and the node.js to a small C program.
So I didn't see this before today, I will respond anyway and the siblings can also see.
There is quite a big difference between the 65B model and especially the 13B and 7B models.
But here is the bash script[0] I launch my "go to" AI, it's called Omnius :)
As written in the previous comment, it is a modified version of the examples/chat-13b.sh that is included in the llama.cpp github.
Ah thanks a lot, I tried out the llama.cpp examples before the k-shot chat prompts, this is definitely much better!
I have a 5950X as well, but sadly, token generation is a bit too slow for me now. (I've had turbo turned off for efficiency as well, but maybe I'll see if the extra cycles helps).
How do you talk to llama? It doesn't respond to instructions, so it's a bit complicated to have it extract keywords and/or summarize texts. Can you please share examples of llama prompts?
I really want to try out the 65B model, I hear great things! Sadly none of my computers can handle it, slightly tempted to get more RAM but I’m on a 8 core i7.
I just bought another 64 for my computer that will arrive in the mail after easter. That will allow me to run the full 65B FP16 model, however it will probably be much slower than the 4 bit quantized version as it has to do more math.
My biggest hope in the end is that we will get a library that can utilize the unified memory model of the AMD platform and run these things with a combination of system ram and GPU. I think intel also has something similar in their platforms.
Not sure how well llama.cpp runs with 8 cores and so big weights though, I am really pushing how usable it is due to speed with my 5950x already.
Perhaps we'll even get dedicated AI boards in the end, much like GPUs today.
I'm in the process of testing the various self-hosted LLMs. I have an M2 MBA laptop and a 5950X w/ 64GB RAM and an RTX 4090 (24GB VRAM).
I've used ChatGPT 3.5 and 4 quite a bit, and have done a bunch of comparisons w/ nat.dev's Playground between a variety of models (claude-instant provides gpt-3.5-turbo level output and is about 3-4X faster; gpt-3.5-turbo, text-davinci-003 to me are about equal and about the cutoff level of where they are generally useful for me - reliability as an end user for summarizations, Q&A, code assistance, etc).
I found all the raw LLaMA variants I could run (up to 30B) to not be very coherent or useful. pythia, gpt-j, gpt-neox, chatglm and the other open raw models I found to be much worse than what the various eval scores would suggest (PIQA, HellaSwag, WinoGrande, ARC-e, etc)? I did a fair amount of playing w/ inference hyper-parameters early on to no avail, but did not do much k-shot learning or proper prompts (like the one's Scale AI uses for training).
I tried a bunch of other Alpaca/instruction-tuned models and they're better, but IMO still not very good. GPT4All w/ the unfiltered checkpoint was the only one that did OK until I tried Vicuna (13B load-8-bit on GPU; I tried Baiz but wasn't impressed, have yet to try Koala, but don't have high expectations). Vicuna does a better job than GPT4All, but I did notice some of the going off the rails/not stopping - it however strongly leans on "as an AI language model..." responses - IMO, any fine-tune based on ChatGPT output really should filter that out, it really knee caps the responses.
One surprise, while it generally doesn't perform quite as well, it tends to be more lucid and in some cases does a significantly better job, is RWKV Raven (ChatRWKV is pretty easy to get going; I can run the v7 14B fw/ fp16int8 in about 16GB of VRAM).
The rate of advancement over just a few weeks is really impressive and it's been really fun catching up on the state of the art on LLMs (I wasn't paying much attention before, despite playing around a bunch w/ SD image generation models previously) and I'm still learning, but after poking around w/ these "smaller" self-hosted models makes me wonder if there's some threshold (50B+ params?) or other secret sauce that captures the "magic" that gpt-3.5 seems to reach (from benchmarks LLaMA 65B is supposed to outperform Chichilla 70B, Gopher 280B, and even match PaLM 540B - gpt-3.5 is ~175-200B, gpt-4 is estimated at 1T parameters).
I have not experienced that problem, but it sounds both annoying and funny. How often do you encounter it? A few times a day but it varies quite a bit. Have you been able to tell what causes it? Shorter prompts sometimes cause it but I have seen it on longer prompts as well. Is there a new version that fixes the issue? Not that I've seen released but it is worth checking.
It happens randomly, and I tried adjusting the gradio+model settings to match FastChat. I should start taking some screenshots of these cause some are funny. I asked how do I update a git repo and it answered correctly with git pull. Then it added HUMAN: how do I delete the whole folder and start over and answered ASSISTANT: try using git reset —hard, if not use rm -rf (paraphrasing here).
Seems like llama derived model are flourishing. However with llama is licensed as academic only and noncommercial model, what is the path for bringing this to production of for profit purpose?
The methodology for alpaca has proven powerful and it's being applied to model with better licensing. It's hard to track lineage, but I think openassistant models are the most permissive at the moment, they use a openly sourced set of data to build an instruct model on top of phiia, which itself is a gptneox trained on a duplicated version of the famous the pile dataset.
The problem is verifying the licensing claims for these composed solutions is becoming exceedingly hard.
The Silicon Valley ethos has always been - do it first worry about legality later. If you go bust - nobody will care. If you become small - you will be ignored. If you go big - lawyers will figure something out to cut a deal.
Yes, I was speaking of the general ethos, not a specific case. But let's take Uber as an example of that ethos in action -- Uber committed actual crimes as part of their growth strategy.
> llama is licensed as academic only and noncommercial model
Are weights even copyrightable? I was under the impression that they weren't (although it hasn't been tested, and there's a chance they may run afoul of database rights).
Apple's unified memory should allow running large models like 65B that will not fit on a consumer GPU, but mostly I see people talking about the smaller 7B sizes that can run anywhere.
They are leveraging Apple’s Metal Performance Shaders[1] not the neural engine. From the chart, it looks like you might get ~20x max boost on inference over plain CPU. Obviously, it's not like having RTX 4090 but better than nothing.
So the question is how much ram do you need? You and another person mentioned 96gb, the person below says he can run it with 24gb. What's the proper amount that is the best amount of ram for now? Of course 128gb/max is the best, but what's a great amount to have now. I never bought an m1, thinking of buying one now ;-)
Vicuna-13B loads and idles at ~26GB RAM usage on a M1Max/64GB. When answering questions, that grows to around 75GB, and yes, you can feel it (and the machine) slow down significantly when it starts hitting swap. I think realistically you'd be wanting to stick to the 7B model on a 32G machine (even if you could get the weight deltas to apply correctly).
I think you have it backwards. The python (ie, huggingface, etc) implementations of transformers are the complex ones with dependency hell so bad even there's even a layer of package manager / env hell. This version of fastchat (there's 2) required a particular commit of huggingface libs for quite a while. Something that only changed recently. And it'll happen again in the future. Python just hides this complexity... until it doesn't. Like beautiful but rapidly rotting fruit.
llama.cpp will remain a single two line project (git clone https://github.com/ggerganov/llama.cpp, make -j) that will compile easily and run on anything. No external deps to pin to a particular commit (that will only have a lifetime of some months) as things change rapidly.
That said, the changes in the ggml weights format the last 2 weeks were annoying, but now that the mmap-style weights are settled on it should be less converting. In that sense huggingface wins, it only has two incompatible weights formats. llama.cpp's ggml has had 3.
Using nix and then complaining about having to set up your compilation environment libs/etc is kind of like sticking a rod in your bike's wheel spokes and complaining about crashing. Don't give up on the idea of system libraries (ie, use nix) and this doesn't happen.
Also, hi? I don't recall you by that nick but the internet is a small place sometimes.
oh, I'm very aware that I've brought this upon myself, but I'm sticking out for the greater good (and stubbornness.)
specifically, I'm trying to benchmark a bunch of different GPU configurations on different workloads on vast.ai, which uses Docker containers. I abhor Dockerfiles and my experience building containers with nix has been pleasant, so that's what I'm doing and why. fortunately I think I'm getting past the learning curve.
did our channel survive the demise of freenode? I was andares, I think I used to be annoying but I've gotten better.
yes, almost! I used poetry2nix and grafted a bunch of overrides to fix the torch-2.0 build, and I just got cuda working with it. I'm testing triton now. I'll submit my PR to poetry2nix so watch that space if you want it.
Right. That particular problem has been fixed. But the fact that it was needed indicates it will happen again. It exposes the underlying complexity of the huggingface transformer stack. It's wonderful code, don't get me wrong. It's just the furthest thing possible from the least complex.
it is really a matter of having faith on pytorch (or JAX) or on third-party cross-platform supports like llama-cpp. Apparently pytorch reduces a lot of complexity and grows extremely faster on cross-platform supports.
Did they release the merged weights, yet? I'd love to try this model.
Afaict from the docs, you still need to request the original Llama weights from Meta (or get ahold of them another way), then apply the diff-weights requiring 60GB RAM?
You can find the merged weights pretty easily online, I wouldn't hold your breath waiting for an official release given the licensing issues around LLaMA.
MacBook with M1 chip here.python installed with homebrew
tried to install with:
pip install fschat
then tried to run it with:
python3 -m fastchat.serve.cli --model -name vicuna-7b --device mps --load-8bit
got this:
traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/opt/homebrew/lib/python3.11/site-packages/fastchat/serve/cli.py", line 9, in <module>
from transformers import AutoTokenizer, AutoModelForCausalLM, LlamaTokenizer
ModuleNotFoundError: No module named 'transformers'
Traceback (most recent call last):
File "/opt/homebrew/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1126, in _get_module
return importlib.import_module("." + module_name, self.__name__)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.11/3.11.2_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
File "<frozen importlib._bootstrap>", line 1128, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 940, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/opt/homebrew/lib/python3.11/site-packages/transformers/models/__init__.py", line 15, in <module>
from . import (
File "/opt/homebrew/lib/python3.11/site-packages/transformers/models/mt5/__init__.py", line 29, in <module>
from ..t5.tokenization_t5 import T5Tokenizer
File "/opt/homebrew/lib/python3.11/site-packages/transformers/models/t5/tokenization_t5.py", line 26, in <module>
from ...tokenization_utils import PreTrainedTokenizer
File "/opt/homebrew/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 26, in <module>
from .tokenization_utils_base import (
File "/opt/homebrew/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 74, in <module>
from tokenizers import AddedToken
File "/opt/homebrew/lib/python3.11/site-packages/tokenizers/__init__.py", line 80, in <module>
from .tokenizers import (
ImportError: dlopen(/opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so, 2): no suitable image found. Did find:
/opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so: mach-o, but wrong architecture
/opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so: mach-o, but wrong architecture
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/opt/homebrew/lib/python3.11/site-packages/fastchat/serve/cli.py", line 9, in <module>
from transformers import AutoTokenizer, AutoModelForCausalLM, LlamaTokenizer
File "<frozen importlib._bootstrap>", line 1231, in _handle_fromlist
File "/opt/homebrew/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1116, in __getattr__
module = self._get_module(self._class_to_module[name])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1128, in _get_module
raise RuntimeError(
RuntimeError: Failed to import transformers.models.auto because of the following error (look up to see its traceback):
dlopen(/opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so, 2): no suitable image found. Did find:
/opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so: mach-o, but wrong architecture
/opt/homebrew/lib/python3.11/site-packages/tokenizers/tokenizers.cpython-311-darwin.so: mach-o, but wrong architecture
You need to use the transformers from the main branch instead of the pypi version, because the llama support is recently added. According to the readme of the repo, you need to install transformers with: pip3 install git+https://github.com/huggingface/transformers
This is incredible to me (not your comment per se, but what you're referencing). I really don't understand how brittle and fragile Python is with all its dependencies. It's crazy to me that a simple bump from 3.10 to 3.11 can break Pytorch. This is like bumping your Ruby version up one level and suddenly Rails doesn't work.
Why on earth is Python like this? It's so frustrating coming from other languages where the dependency management plan isn't just so YOLO and free-for-all.
I have despised Python ever since the 2=>3 transition for the reasons you say. Tools like pyenv help, but it's still a mess. It makes me sad that all the popular ML tooling ends up built in Python.
I wonder how much the space has been encumbered by Python’s relative weaknesses. As a bit of an outsider, I kind of assume there’s some hidden advantage of Python for AI/ML that I just don’t “get.”
My one take away after playing with both chat mode and text completion modes is that gpt4all 7B 4bit stays on the chat rails (doesn't start taking the role of the user, or spewing fine tuning boilerplate) much better than vicuna 7B 4bit. In text completion they're about the same but I'd still prefer the vanilla llama 7B in that case.
Lmsys hasn't released any official 4-bit version. It might be a better idea to wait for the official 4-bit version. But it is interesting to learn that the third-party 4bit version has performance degeneration.
Lmsys hasn't released any official weights for anything. They've released "deltas" and other people have applied those deltas to the appropriate llama weights and done the quantization.
I reject your premise that the 8 to 4 bit quantization is the cause of the vicuna fine-tuned llamas very average performance though. This hasn't been the case for any of the other 8 to 4 bit quantizations. It would be a unique outlier. And so I don't think this is the "cause" here.
My point is that I am not aware of any official 4-bit quantization version (delta or weights) by lmsys so it might too early to draw your conclusion that vicuna finetuned llamas degenerates a lot of performance at 4 bit but others are fine.