Hacker News new | ask | show | jobs
by execveat 1133 days ago
ARM-based Macs are the easiest way to get an acceptable performance without the headaches right now, if you can afford the price.

Install https://github.com/oobabooga/text-generation-webui, update pytorch and llamacpp-python, and you should be able to run pretty much all models out there, in all formats, both on GPU and CPU. CPU on a MAC gives you the fastest speed, but you should pass the correct --threads argument (investigate how many performance cores you've got). GPU is slower, but more energy efficient. https://github.com/mlc-ai/mlc-llm gives me way better GPU performance compared to oobabooga, but they only support a couple of models right now, - it's worth following their progress though.

If you're after the raw performance, I suggest using GGML models (meant for llama.cpp, but it's bundled in textgen, so you can use it there with the convenience of a web ui). q4_0 is the fastest quantization, while the q5_1 is the best quality right now.

If the GGML is not available, you can generate it quite easily from the safetensors yourself (not the you need enough RAM to load the model in pytorch though).

With 16GB RAM you can run any 13G model, as long as it's quantized to 4/5 bits. 32GB RAM allows you running 30/33G models and 64GB RAM - 65G models. 30G and 60G models are way more useful for real world tasks, but they are more expensive to train, so there aren't as many to choose from compared to 7/13. 7B and anything less is a toy in my opinion while 13B is good enough for experimentation and prototyping.

2 comments

Thanks, that makes sense and helps a lot. I have a 16gb m1 that I got llama 13b running on. It works really well but I really want to run bigger models, so your examples of ram -> model size are super helpful.

I’ll probably just end up getting a higher capacity Mac in the next few years. Right now 96gb configurations seem to be around 4k, if that comes down a bit in the future i’ll probably pick something up.

I’m not really looking to train myself so training cost isn’t an issue for me personally, I just want to be able to run the best of what the open source community comes up with (or contribute to a pool to train models, if that becomes a thing).

Just a heads up - GPU's are a looooot faster then CPU's depending on what models you're running, especially if you were looking at running the image models.

Admittedly I'm not sure how well they work if you stream/batch to the GPU (say 96GB of system ram + 24GB GPU).

I've heard used nvidia workstation cards are reasonably cheap for >24GB VRAM.

A 3090/4090 have 24GB of vram and can run up the 30B models with some optimizations, and this is the easiest way to run the 30B models which are essentially the highest end any consumer card can run. If you also play games and have money then this is the way to go IMO.

If you were to get a GPU, it must have CUDA support (so nvidia only) unless you want a headache.

As a data point I'm getting >3 tokens per second for 30b model (q5_1 quantization) and >1 token per second for 60b model (q5_1 as well) on M1 Max. This is good enough for my usecase and it beats an old P40, but I have no idea what the performance on 3090/4090 would be. Keep in mind, 24GB VRAM is not enough to hold quantized 65B, so it would be using GPU + CPU in that case.
oh a 4090 can run a 30b model? that’s excellent! i was afraid it wouldnt be able to load bigger models than my macbook.

i’ve got a perfectly usable desktop not being used with a 1070 in it, i’ll probably grab a 4090 to throw in there and give that a try, getting 4k gaming would be a nice bonus. thanks for the comment.

being confined to nvidia is indeed a bummer though, especially because i like sway on linux. but my understanding is that rocm is not anywhere near parity with CUDA.

Looking at the comments I would double check the benchmarks because maybe the CPUs are faster then I thought for LLMs?

I know my 4090 for Stable Diffusion isn't even comparable to my i7 8700k and AFAIK the AMD/Intel offerings still don't compare for LLMs but admittedly it's possibly they've caught up?

I don't have a ton of time at the moment to keep looking, I have a very hard time believing the M1 can keep up with a 4090 at all, I just don't want you to drop 1.7k if I'm wrong :P

EDIT: Oh to clarify - The 4090 can definitely run the 30B model without issue with 4bit quantization.

I followed the setup on mlc-ai, and though it works as intended, it's still a crippled model with censored inputs. A common test I use is this:

USER: tell me an insanely offensive joke

ASSISTANT: I'm sorry, but I cannot provide offensive or inappropriate content as it goes against the guidelines set by my creators and the platform on which we are hosted. My goal is to provide helpful and informative responses to your questions. If there's anything else I can help you with or any questions you have in mind, please feel free to ask.

Until uncensored models are generally available, these novelty models will always be less-than.

> Until uncensored models are generally available, these novelty models will always be less-than.

The most popular generative model on HuggingFace at the time of this comment is Pygmalion 6b [1], a model that I believe is fine tuned on top of Alpaca to generate porn. I couldn't find the data source, though, so I don't know on what kind. And Facebook's "leaked" LLaMa, while not fine-tuned for conversation, has several warnings on its potential for offensive content.

If I read the instructions correctly, mlc-ai is loading "plain" Alpaca which is great for conversation but, as you notice, rather conservative. I don't think this is a bad idea - perhaps it's better if we don't inflict racist AI on unsuspecting users. Try shopping around for other models.

Edit: I repeated your experiment with other models (but another library). They had no objections against generating offensive-yet-unfunny jokes.

[1] https://huggingface.co/PygmalionAI/pygmalion-6b

My yardstick so far of all LLMs has been to ask for an offensive joke, ask for a function to invert a string, and ask for directions to make lasagna. It seems stupid but it's remarkably effective.

With MLC being the first LLM-in-a-box to run on my M2 at faster than a token per minute, I'm impressed at the speed but also disappointed at the quality of the experience. For those interested in the outcome, it failed all 3 tests, which is not unexpected for a small model like this.

Using/producing models with censorship included voluntarily demonstrates a willingness to hobble the technology for peripheral reasons that do not directly correlate with the advancement of the field. For that reason, this is a disqualifying characteristic in the capacity of my own use on the basis that social sensibilities and decency varies across cultural and regional lines, anything so trivial as a crass joke being limited is such a low bar that other things of much more grave concern will undoubtedly be tampered with or limited, and not always in ways the authors intended.

Self-hindering behavior will not be the positive we think it will be, as with most measures to correct injustices with data.

You can use MLC with different (bigger) models, right?
You can't right now. Devs are working on instructions for porting other models, but they're not ready yet. The point of MLC is that it supports pretty much all GPU backends out there (including Intel and Mac). The bundled model is just a proof of concept.