Hacker News new | ask | show | jobs
by zepearl 4 hours ago
I agree. To run an acceptable model (e.g. Qwen/Qwen3.6-27B or google/gemma-4-31B) with a good quantization (minimum Q5) with a good context size (min 64k) you could buy 2 or even 3 GTX 5060 16GiB VRAM for ~550$ each. Fyi the much faster MoE models were useless for my usecases - e.g not able to correctly identify me/I/you, endless thinking loops, etc.

I'm currently running those models using an RTX 5070 12GiB + RTX 5060 16GiB + RTX 3060 12GiB with a 96k context size with MTP/speculative decoding and I'm quite happy (the 5070 is about 4x faster than the 3060, the 5060 is inbetween them so about 2x faster than a 3060).

2 comments

How are you running these together, splitting the model somehow or did you mean different models on any one card at a time?
how many tokens per second do you get?
I bought two RTX3080s with 20GB during my holiday in china (set me back 700euros) I'm getting 800-1000 input tps and 60-100tps output with Qwen 3.6 27b Q8 (MTP, P2P, 200k context) this feels like opus4.5 level while coding (pi harness). Also easy to just host your own openai compatible api from home this way and still use your MacBook as dev station.