| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by anon373839 596 days ago
	We’re already past that point! MacBooks can easily run models exceeding GPT-3.5, such as Llama 3.1 8B, Qwen 2.5 8B, or Gemma 2 9B. These models run at very comfortable speeds on Apple Silicon. And they are distinctly more capable and less prone to hallucination than GPT-3.5 was. Llama 3.3 70B and Qwen 2.5 72B are certainly comparable to GPT-4, and they will run on MacBook Pros with at least 64GB of RAM. However, I have an M3 Max and I can’t say that models of this size run at comfortable speeds. They’re a bit sluggish.

2 comments

noman-land 594 days ago

The coolness of local LLMs is THE only reason I am sadly eyeing upgrading from M1 64GB to M4/5 128+GB.

link

Terretta 594 days ago

Compare performance on various Macs here as it gets updated:

https://github.com/ggerganov/llama.cpp/discussions/4167

OMM, Llama 3.3 70B runs at ~7 text generation tokens per second on Macbook Pro Max 128GB, while generating GPT-4 feeling text with more in depth responses and fewer bullets. Llama 3.3 70B also doesn't fight the system prompt, it leans in.

Consider e.g. LM Studio (0.3.5 or newer) for a Metal (MLX) centered UI, include MLX in your search term when downloading models.

Also, do not scrimp on the storage. At 60GB - 100GB per model, it takes a day of experimentation to use 2.5TB of storage in your model cache. And remember to exclude that path from your TimeMachine backups.

link

noman-land 593 days ago

Thank you for all the tips! I'd probably go 128GB 8TB because of masochism. Curious, what makes so many of the M4s in the red currently.

link

vessenes 593 days ago

It's all memory bandwidth related -- what's slow is loading these models into memory, basically. The last die from Apple with all the channels was the M2 Ultra, and I bet that's what tops those leader boards. M4 has not had a Max or an Ultra release yet; when it does (and it seems likely it will), those will be the ones to get.

link

ant6n 593 days ago

What if you have a Macbook Air with 16GB (the bechmarks dont seem to show memory).

link

simonw 593 days ago

You could definitely run an 8B model on that, and some of those are getting very capable now.

The problem is that often you can't run anything else. I've had trouble running larger models in 64GB when I've had a bunch of Firefox and VS Code tabs open at the same time.

link

xdavidliu 592 days ago

I thought VSCode was supposed to be lightweight, though I suppose with extensions it can add up

link

evilduck 593 days ago

8B models with larger contexts, or even 9-14B parameter models quantized.

Qwen2.5 Coder 14B at a 4 bit quantization could run but you will need to be diligent about what else you have in memory at the same time.

link

chris_st 593 days ago

I have a M2 Air with 24GB, and have successfully run some 12B models such as mistral-nemo. Had other stuff going as well, but it's best to give it as much of the machine as possible.

link

gcanyon 593 days ago

I recently upgraded to exactly this machine for exactly this reason, but I haven't taken the leap and installed anything yet. What's your favorite model to run on it?

link

stkdump 593 days ago

I bought an old used desktop computer, a used 3090, and upgraded the power supply, all for around 900€. Didn't assemble it all yet. But it will be able to comfortably run 30B parameter models with 30-40 T/s. The M4 Max can do ~10 T/s, which is not great once you really want to rely on it for your productivity.

Yes, it is not "local" as I will have to use the internet when not at home. But it will also not drain the battery very quickly when using it, which I suspect would happen to a Macbook Pro running such models. Also 70B models are out of reach of my setup, but I think they are painfully slow on Mac hardware.

link

jazzyjackson 593 days ago

I'm returning my 96GB m2 max. It can run unquantized llama 3.3 70B but tokens per second is slow as molasses and still I couldn't find any use for it, just kept going back to perplexity when I actually needed to find an answer to something.

link

Tepix 591 days ago

Interesting. You're using the FP8 version i'm guessing? How many tokens/s are you using and which software? MLX?

link

alecco 593 days ago

I'm waiting for next gen hardware. All the companies are aiming for AI acceleration.

link

kleiba 593 days ago

Sorry, I'm not up to date, but can you run GPTs locally or only vanilla LLMs?

link

kgeist 593 days ago

>MacBooks can easily run models exceeding GPT-3.5, such as Llama 3.1 8B, Qwen 2.5 8B, or Gemma 2 9B.

If only those models supported anything other than English

link

simonw 593 days ago

Llama 3.1 8B advertises itself as multilingual.

All of the Qwen models are basically fluent in both English and Chinese.

link

kgeist 593 days ago

Llama 8B is multilingual on paper, but the quality is very bad compared to English. It generally understands grammar, and you can understand what it's trying to say, but the choice of words is very off most of the time, often complete gibberish. If you can imagine the output of an undertrained model, this is it. Meanwhile GPT3.5 had far better output that you could use in production.

link

barrell 593 days ago

Cohere just announced Command R7B. I haven’t tried it yet but their larger models are the best multilingual models I’ve used

link

numpad0 593 days ago

Is subtext to this uncensored Chinese support?

link