Hacker News new | ask | show | jobs
by sosodev 5 hours ago
You can run a trillion parameter model with decent quality for far less than $300k. A cluster of 4 AMD AI Max 395+ boards with 128GB unified memory each can be had for around $15k. That would run the 4-bit quant of a trillion param model well enough for personal use. At full use the cluster would only be consuming around 400-500W of power too. That's about the same as one high end graphics card.

That's still a lot of money, but most people don't really need a trillion parameter model. If privacy is more valuable than the frontier capabilities then they could almost certainly get by with much less.

1 comments

Which model? I see a suspiciously similar post on amd.com running 2 bit Kimi quant on a four node cluster over 5Gbps Ethernet

Assuming math works here although I think there's some caveats depending on the model architecture, 1T 4 bit is 465Gi just for the weights so you wouldn't be able to fit kv cache.

It's showing about 8-9 tk/sec which seems quite slow for something like a web search with result aggregate although maybe bareable for smaller context stuff

The thing I've been running into with z.ai hosted GLM-5.2 is the 2024 knowledge cutoff. Anything recent requires web augmentation which is more token intensive so low tk/sec hurts even more than a "smarter" model

It seems (somewhat unsurprisingly) open weight models have older knowledge cutoffs.

I don’t have any particular model in mind, sorry. My data is just rough estimates based on my experience with a single node setup. You might need to opt for a 2 or 3 bit model to get the full context window. The KV cache memory consumption as well overall performance will be heavily dependent on the model’s architecture. The performance too will depend a lot on the inference server chosen and its configuration. I suspect a sub-agent running a much smaller model would be the ideal way to get the latest knowledge via web search and summarization.

I’m not trying to say that this would be a great experience or really compete with just buying a subscription to the top models. Rather I just wanted to point out that $300k is an absurd estimate for a trillion param model meant for personal use.

I imagine a smaller single node model would have a significantly better experience at significantly lower cost. When I was poking around with infra estimates it seemed the main issue/cost was once you crossed from single-node to multi-node. You need _a lot_ of bandwidth if the weights are sharded. Like Tbps of bandwidth. The closest reasonable thing I've heard of for local multi-node is exo on macos using thunderbolt interconnect.