| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by winternewt 113 days ago
	And if you don't want to buy a Mac? A 80 GB NVidia GPU costs $10,000K (equivalent to 30 years of ChatGPT Plus subscription) and will probably be obsolete in 5-7 years anyway. What are my options if I want a decent coding agent at a reasonable price?

8 comments

zepearl 113 days ago

I downloaded Ollama ( https://github.com/ollama/ollama/releases ) and experimented with a few Qwen models ( https://huggingface.co/Qwen/collections ).

My performance when using an RTX 5070 12GiB VRAM, Ryzen 7 9700X 8 cores CPU, 32GiB DDR5 6000MT (2 sticks):

  - "qwen2.5:7b": ~128 tokens/second (this model fits 100% in the VRAM).
  - "qwen2.5:32b": ~4.6 tokens/second.
  - "qwen3:30b-a3b": ~42 tokens/second (this is a MoE model with multiple specialized "brains") (this uses all 12GiB VRAM + 9GiB system RAM, but the GPU usage during tests is only ~25%).
  - qwen3.5:35b-a3b: ~17 tokens/second, but it's highly unstable and crashes -> currently not usable for me.

So currently my sweet spot is "qwen3:30b-a3b" - even if the model doesn't completely fit on the GPU it's still fast enough. "qwen3.5" was disappointing so far, but maybe things will change in the future (maybe Ollama needs some special optimizations for the 3.5-series?).

I would therefore deduce that the most important thing is the amount of VRAM and that performance would be similar even when using an older GPU (e.g. an RTX 3060 with as well 12GiB RAM)?

Performance without a GPU, tested by using a Ryzen 9 5950X 16 cores CPU, 128GiB DDR4 3200 MT:

  - "qwen2.5:7b": ~9 tokens/second
  - "qwen3:32b": ~2 tokens/second
  - "qwen3:30b-a3b": ~16 tokens/second

timschmidt 113 days ago

I'm able to run the Unsloth quants on an ancient dual socket Xeon 1U server I keep around for homelab stuff. It has 8 DDR3 channels, which gives me about as much memory bandwidth as two channels of DDR5 :-/ But 16 sockets and cheaper prices. So it has 256gb in it right now. I have to run the minimum size Unsloth quant for the largest open weight models. They definitely feel a bit dazed. This machine can support up to 1.5TB of DDR3, which would allow me to run many of the largest models unquantized, but at 1/4 of the already abysmal speeds I see of ~ 1 Token / s which is only really usable with multiple agents running a kanban style async development process. Nothing interactive. That said, I picked up the hardware at the local surplus for $25 and it's vintage ~2010. Pretty impressive what this enterprise gear can do.

Power consumption? Don't ask. A subscription is cheaper.

paganel 113 days ago

> Power consumption

That’a the thing, at the end of it all power consumption will matter more for the end-user who doesn’t have money to burn away, because I suspect that power-consumption will, in the majority of cases, exceed the price of the HW itself in a matter of just a few months of intense use, let’s say a year.

timschmidt 113 days ago

Assuming models of a fixed size continue to improve in capability, continued advancement in semiconductors and optimization will reduce power consumption and/or improve performance over time. And used equipment will always approach the scrap price eventually. For me today, on scrap equipment, I get about 4 tokens / watt-hour, which is nominally ~$0.17 US but could run $0.40 after all the taxes and fees and surcharges. $0.10 / token. Ouch.

If I were to try to purpose build a rig for it, I would get an engineering sample Epyc/motherboard/ram combo from Aliexpress with 12 channels of DDR5 and as few cores as allowed me to still use all the memory bandwidth, and I'd run it at the lowest possible power and voltage settings with aggressive ram timings. A system like that can draw 1/3 of what my scrap rig draws, at full load. And has similar memory bandwidth to a high end Mac or GPU allowing it to crank out 5 - 10 Tokens / s on the largest models, which works out to 1/3 of a penny to 2/3 of a penny per token. But either way, Epyc or Mac is going to set you back $10k or more. Hopefully in a few years when they are scrap though...

siquick 113 days ago

Rent a H100 on Modal which scales down to zero when not in use - you can set the time out period.

Cold boot times are around 5m but if your usage periods are predictable it can work out ok. Works out at $2 an hour.

Still far more expensive than a ChatGPT sub.

flyingjoe 113 days ago

Do you have some reference on what setup you're talking about? I'd like to integrate it into my IDE (cursor/vscode) - are there docs on such a setup?

siquick 113 days ago

Start here

https://modal.com/docs/examples/vllm_inference

or give this a go

https://modal.com/docs/examples/opencode_server

You get $30 free credits each month on Modal which is enough to play around (i have no affiliation, just think they run a great service)

segmondy 113 days ago

GPUs are not going obsolete anytime soon. the nvidia p40/p100 launched in 2016, 10 years ago and is popular in the local space. My first set of GPUs were a bunch of P40s from 3 years ago for $150 a piece. They at one point went up all the way to $450, but price is now down to $200 range. I think I have gotten my value from those and I suspect I'll still have them crunching out tokens for at least 3 more years. They still beat 90% of cpu/memory inference combo.

krenerd 113 days ago

Indeed, the point is that it's going for 150$

segmondy 113 days ago

My point being that no one should be buying expensive GPUs when you can pick up a few used ones to get started. But for the sake of discussion let's say you do get a blackwell pro 6000 that's now going for $10,000. I can assure you it will not be $150 10 years from now, with the falling price of dollar, demand for AI inference and hardware shortage, it might cost exactly the same 10 years from now...

winternewt 108 days ago

Unless the bubble bursts and tons of failing AI companies dump used graphics cards on the market.

atwrk 113 days ago

A Strix Halo with 128GB unified memory is less than $2k and the more suitable alternative to a mac. I'm pretty happy with my device (Bosgame M5).

segmondy 113 days ago

the macs outperform it and I figure it's a better general purpose computer than strix halo. if budget is a problem, then a strix halo is a decent alternative.

atwrk 113 days ago

Well a mac isn't really an alternative to a mac, or is it? ;)

Personally I'm not interested in having a mac as I work with linux. And yes, they outperform them, but only if you ignore the price. When comparing what you get for ~$2k, a Strix Halo is miles ahead.

pimeys 113 days ago

Mac doesn't run Linux so in my books is a worse general purpose computer than a Strix Halo box.

Keyframe 113 days ago

A Strix Halo with 128GB unified memory is less than $2k

Where did you get that price? Wherever I looked it's around 3k euros which is around $3.5k

atwrk 113 days ago

Directly from Bosgame.com, for ~1.7k€ in December. I see it's at $2.2k / 1.9k€ now.

Keyframe 113 days ago

why haven't I checked their site first is beyond me :) Thank you for this! You say you're satisfied, right?

atwrk 112 days ago

Yeah I'm pretty happy with the M5 (beside the look). It's most probably the same SixUnited board most Strix Halo devices use (including the ones from HP and Lenovo).

rookonaut 113 days ago

Can you elaborate more on your use cases, models, setup,...?

meta-level 113 days ago

I took my setup from here: https://github.com/kyuz0/amd-strix-halo-toolboxes

Still lot to learn, but after a while you have something like Qwen3-Coder-Next-Q8_0 running and - at least for me - it works quite well, both as ChatGPT like chat-interface using llama.cpp and as coding agent

atwrk 113 days ago

I'm not really using them for coding (only played a little bit with minimax2.1), which is probably the most common use case here.

I mainly use them for deep work with texts and deep research. My main criterion is privacy, both for legal reasons (I'm in the EU and can't and don't want to expose customer's data to non-gdpr-compliant services) and wouldn't use US services personally either, e.g. I would never explore health related topics chatgpt or gemini for obvious reasons.

Technically I've set it up in my office with llama.cpp and have exposed that (both chat interface and openai compatible api) with a simple wireguard tunnel behind nginx and http auth. Now I can use it everywhere. It's a small, quiet and pretty fast machine (compiling llama.cpp is around 20 seconds?), I quite like it.

Keyframe 113 days ago

What are my options if I want a decent coding agent at a reasonable price?

I'd even come from another angle.. What are my options if I want a decent coding agent, on the level of what Claude does at any given price? Let's say few tens of thousands of dollars? I've had a limited look at what's available to be run locally and nothing is on par.

renewiltord 113 days ago

Does not exist AFAIK. Even other labs struggle with Claude level performance in real world task. My experience is that no open model is close. You can get RTX 6000 Pro Blackwell (Max-Q is better for power is half). I have heard good things about Qwen3 coder next but I could not get tool calling to be high performance but it’s likely to be pebkac.

If you want to spend big bucks get h200 141 GB but honestly RTX 6000 pro is good enough till you know what you want. Workstation edition is good. It takes care of cooling etc.

Tbh even better is to just get model through cloud. If you want you can rent GPU. Then see if it’s what you want.

Keyframe 113 days ago

The gist of it is no matter the money you spend on hardware, you will not get the same quality you get from claude. Main question is then what can you run that's good enough? I haven't tested all there is available, but everything I did see does not come even close.

khalic 113 days ago

You can rent GPUs, this comes with a security, maintenance and performance overhead, but also has a few advantages.

But right now, a Mac is the easiest way because of their memory architecture.

am17an 113 days ago

Honestly you can run this on a 16GB VRAM GPU with llama.cpp. Just try it!