Hacker News new | ask | show | jobs
by jpgvm 5 hours ago
If you want a massive MacBook anyway then it's great. They are decent for local LLMs, awesome for local image models and it's a MacBook so AppleCare+ has your back. IMO it's a no brainer if you wanted a MacBook anyway but it's a poor choice if your reason to buy it is to run LLMs.
2 comments

I agree. To run an acceptable model (e.g. Qwen/Qwen3.6-27B or google/gemma-4-31B) with a good quantization (minimum Q5) with a good context size (min 64k) you could buy 2 or even 3 GTX 5060 16GiB VRAM for ~550$ each. Fyi the much faster MoE models were useless for my usecases - e.g not able to correctly identify me/I/you, endless thinking loops, etc.

I'm currently running those models using an RTX 5070 12GiB + RTX 5060 16GiB + RTX 3060 12GiB with a 96k context size with MTP/speculative decoding and I'm quite happy (the 5070 is about 4x faster than the 3060, the 5060 is inbetween them so about 2x faster than a 3060).

How are you running these together, splitting the model somehow or did you mean different models on any one card at a time?
how many tokens per second do you get?
I bought two RTX3080s with 20GB during my holiday in china (set me back 700euros) I'm getting 800-1000 input tps and 60-100tps output with Qwen 3.6 27b Q8 (MTP, P2P, 200k context) this feels like opus4.5 level while coding (pi harness). Also easy to just host your own openai compatible api from home this way and still use your MacBook as dev station.
are you saying because of speed or it just cant run them?