| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by chadash 116 days ago

> Will LLMs be cheaper than humans once the subsidies for tokens go away? At this point we have little visibility to what the true cost of tokens is now, let alone what it will be in a few years time. It could be so cheap that we don’t care how many tokens we send to LLMs, or it could be high enough that we have to be very careful.

We do have some idea. Kimi K2 is a relatively high performing open source model. People have it running at 24 tokens/second on a pair of Mac Studios, which costs 20k. This setup requires less than a KW of power, so the $0.8-0.15 being spent there is negligible compared to a developer. This might be the cheapest setup to run locally, but it's almost certain that the cost per token is far cheaper with specialized hardware at scale.

In other words, a near-frontier model is running at a cost that a (somewhat wealthy) hobbyist can afford. And it's hard to imagine that the hardware costs don't come down quite a bit. I don't doubt that tokens are heavily subsidized but I think this might be overblown [1].

[1] training models is still extraordinarily expensive and that is certainly being subsidized, but you can amortize that cost over a lot of inference, especially once we reach a plateau for ideas and stop running training runs as frequently.

10 comments

embedding-shape 116 days ago

> a near-frontier model

Is Kimi K2 near-frontier though? At least when run in an agent harness, and for general coding questions, it seems pretty far from it. I know what the benchmarks say, they always say it's great and close to frontier models, but is this other's impression in practice? Maybe my prompting style works best with GPT-type models, but I'm just not seeing that for the type of engineering work I do, which is fairly typical stuff.

crystal_revenge 116 days ago

I’ve been running K2.5 (through the API) as my daily driver for coding through Kimi Code CLI and it’s been pretty much flawless. It’s also notably cheaper and I like the option that if my vibe coded side projects became more than side projects I could run everything in house.

I’ve been pretty active in the open model space and 2 years ago you would have had to pay 20k to run models that were nowhere near as powerful. It wouldn’t surprise me if in two more years we continue to see more powerful open models on even cheaper hardware.

vuldin 116 days ago

I agree with this statement. Kimi K2.5 is at least as good as the best closed source models today for my purposes. I've switched from Claude Code w/ Opus 4.5 to OpenCode w/ Kimi K2.5 provided by Fireworks AI. I never run into time-based limits, whereas before I was running into daily/hourly/weekly/monthly limits all the time. And I'm paying a fraction of what Anthropic was charging (from well over $100 per month to less than $50 per month).

hjordache 116 days ago

Beyond agree. Was spending crazy amounts on Claude and it was sporadic at best. Some moments, Opus was a rockstar, others, it couldn’t solve the simplest of problems. Switched to Kimi K2.5 and honestly didn’t think it would do anything other than destroy my code. Crazy enough, it solved the problem I had in less than 60 seconds and I was hooked. Not to say it doesn’t have issues, it does, started repeating itself over and over, forgets things after so much context, etc, though it writes damn good code when it does work properly and for an absolute fraction of the price Anthropic charges.

cadamsdotcom 116 days ago

Saw you wrote that you moved away from Opus 4.5. If you haven’t tried Opus 4.6, there’s only one number different in the name, but the common experience is it’s significantly better.

Have you tried 4.6 as a comparison to Kimi K2.5?

giancarlostoro 116 days ago

> OpenCode w/ Kimi K2.5 provided by Fireworks AI

Are you just using the API mode?

hjordache 116 days ago

API mode and Kimi k2.5 is currently free on OpenCode. Enjoy!

giancarlostoro 112 days ago

What? Like self hosted or what? Because I'm eerie of using any API services if it's not US based, I don't need all my IP going overseas.

varispeed 116 days ago

Depends what you see as flawless. From my perspective even GPT 5.2 produces mostly garbage grade code (yes it often works, but it is not suitable for anywhere near production) and takes several iterations to get it to remotely workable state.

crystal_revenge 116 days ago

> not suitable for anywhere near production

This is what I've been increasingly understanding is the wrong way to understand how LLMs are changing things.

I fully agree that LLMs are not suitable for creating production code. But the bigger question you need to ask is 'why do we need production code?' (and to be clear, there are and always will be cases where this is true, just increasingly less of them)

The entire paradigm of modern software engineering is fairly new. I mean it wasn't until the invention of the programmable microprocessor that we even had the concept of software and that was less than 100 years ago. Even if you go back to the 80s, a lot of software doesn't need to be distributed or serve a endless variety of users. I've been reading a lot of old Common Lisp books recently and it's fascinating how often you're really programming lisp for you and your experiments. But since the advent of the web and scaling software to many users with diverse needs we've increasingly needed to maintain systems that have all the assumed properties of "production" software.

Scalable, robust, adaptable software is only a requirement because it was previously infeasible for individuals to build non-trivial systems for solving any more than a one or two personal problems. Even software engineers couldn't write their own text editor and still have enough time to also write software.

All of the standard requirements of good software exist for reasons that are increasingly becoming less relevant. You shouldn't rely on agents/LLMs to write production code, but you also should increasingly question "do I need production code?"

munksbeer 115 days ago

This is a very interesting aspect. I've been thinking along these lines.

Consider design patterns, or clean code, or patterns for software development, or any other system that people use to write their code, and reviewers use to review the code. What are they actually for? This question is going to seem bizarre to most programmers at first, because it is so ingrained in us, that we almost forget why we have those patterns.

The entire point is to ensure the code is maintainable. In order to maintain it, we must easily understand it, and and be sure we're not breaking something when we do. That is what design patterns solve, making easier to understand and more maintainable.

So, I can imagine a future where the definition of "production code" changes.

varispeed 116 days ago

> Scalable, robust, adaptable software is only a requirement because it was previously infeasible for individuals to build non-trivial systems for solving any more than a one or two personal problems. Even software engineers couldn't write their own text editor and still have enough time to also write software.

That's a wild assumption. I personally know engineers who _alone_ wrote things like compilers, emulators, editors, complex games and management systems for factories, robots. That was before internet was widely available and they had to use physical books to learn.

embedding-shape 115 days ago

Yeah, that jumped out from me too. Plenty of hackers could write their own text editor + have time to be professional developers to do other things. How do people think most of FOSS actually happened 15-20 years ago? Most of us were hacking on stuff in our free-time, but still having day jobs.

bspinner 116 days ago

In terms of security: yes, everyone needs production code.

e12e 116 days ago

In my mind, "yolo ai" application (throwaway code on one hand, unrestrained assistants on the other) - is a little like better spreadsheets and smart documents were in the 90s; just run macros! Everywhere! No need for developers - just Word an macros!

Then came macro viri - and practically - everyone cut back hard on distributing code via Word and Excel (in favour of web apps and we got the dot.com bubble).

embedding-shape 116 days ago

> it’s been pretty much flawless

So above and beyond frontier models? Because they certainly aren't "flawless" yet, or we have very different understanding of that word.

crystal_revenge 116 days ago

I have increasingly changed my view on LLMs and what they're good for. I still strongly believe LLMs cannot replace software engineers (they can assist yes, but software engineering requires too much 'other' stuff that LLMs really can't do), but LLMs can replace the need for software.

During the day I am working on building systems that move lots of data around where context and understanding of the business problem is everything. I largely use LLMs for assistance. This is because I need the system to be robust, scalable, maintainable by other people and adaptable to large range of future needs. LLMs will never be flawless in a meaningful sense in this space (at least in my opinion).

When I'm using Kimi I'm using it for purely vibe coded projects where I don't look at the code (and if I do I consider this a sign I'm not thinking about the problem correctly). Are these programs robust, scalable, generalizable, adaptable to future use case? No, not at all. But they don't need to be, they need to serve a single user for exactly the purpose I have. There are tasks that used to take me hours that now run in the background while I'm at work.

In this latter sense I say "flawless" because 90% of my requests solve the problem on the first pass, and the 10% of the time where there is some error, it is resolved in a single request, and I don't have to ever look at the code. For me that "don't have to look at the code" is a big part of my definition of "flawless".

mhitza 116 days ago

Your definition of flawless is fine for you and requires a big asterix. But without being called out on it look how your message would have read for someone that's not in the known of LLM limitations, and contributed further to the dissilusionment of the field and the gaslighting that's already going on by big comapnies.

fullstackchris 116 days ago

regardless its been 3 years since the release of chatgpt. literally 3. imagine in just 5 more years how much low hanging (or even big breakthroughs) will get into the pricing, things like quantization, etc. no doubt in my mind the question of "price per token" will head towards 0

lambda 116 days ago

You don't even need to go this expensive. An AMD Ryzen Strix Halo (AI Max+ 395) machine with 128 GiB of unified RAM will set you back about $2500 these days. I can get about 20 tokens/s on Qwen3 Coder Next at an 8 bit quant, or 17 tokens per second on Minimax M2.5 at a 3 bit quant.

Now, these models are a bit weaker, but they're in the realm of Claude Sonnet to Claude Opus 4. 6-12 months behind SOTA on something that's well within a personal hobby budget.

sosodev 116 days ago

I was testing the 4-bit Qwen3 Coder Next on my 395+ board last night. IIRC it was maintaining around 30 tokens a second even with a large context window.

I haven't tried Minimax M2.5 yet. How do its capabilities compare to Qwen3 Coder Next in your testing?

I'm working on getting a good agentic coding workflow going with OpenCode and I had some issues with the Qwen model getting stuck in a tool calling loop.

lambda 116 days ago

I've literally just gotten Minimax M2.5 set up, the only test I've done is the "car wash" test that has been popular recently: https://mastodon.world/@knowmadd/116072773118828295

Minimax passed this test, which even some SOTA models don't pass. But I haven't tried any agentic coding yet.

I wasn't able to allocate the full context length for Minimax with my current setup, I'm going to try quantizing the KV cache to see if I can fit the full context length into the RAM I've allocated to the GPU. Even at a 3 bit quant MiniMax is pretty heavy. Need to find a big enough context window, otherwise it'll be less useful for agentic coding. With Qwen3 Coder Next, I can use the full context window.

Yeah, I've also seen the occasional tool call looping in Qwen3 Coder Next, that seems to be an easy failure mode for that model to hit.

lambda 116 days ago

OK, with MiniMax M2.5 UD-Q3_K_XL (101 GiB), I can't really seem to fit the full context in even at smaller quants. Going up much above 64k tokens, I start to get OOM errors when running Firefox and Zed alongside the model, or just failure to allocate the buffers, even going down to 4 bit KV cache quants (oddly, 8 bit worked better than 4 or 5 bit, but I still ran into OOM errors).

I might be able to squeeze a bit more out if I were running fully headless with my development on another machine, but I'm running everything on a single laptop.

So looks like for my setup, 64k context with an 8 bit quant is about as good as I can do, and I need to drop down to a smaller model like Qwen3 Coder Next or GPT-OSS 120B if I want to be able to use longer contexts.

lambda 116 days ago

After some more testing, yikes, MiniMax M2.5 can get painfully slow on this setup.

Haven't tried different things like switching between Vulkan and ROCm yet.

But anyhow, that 17 tokens per second was on almost empty context. By the time I got to 30k tokens context or so, it was down in the 5-10 tokens per second, and even occasionally all the way down to 2 tokens per second.

Oh, and it looks like I'm filling up the KV cache sometimes, which is causing it to have to drop the cache and start over fresh. Yikes, that is why it's getting so slow.

Qwen3 Coder Next is much faster. MiniMax's thinking/planning seems stronger, but Qwen3 Coder Next is pretty good at just cranking through a bunch of tool calls and poking around through code and docs and just doing stuff. Also MiniMax seems to have gotten confused by a few things browsing around the project that I'm in that Qwen3 Coder Next picked up on, so it's not like it's universally stronger.

sosodev 115 days ago

Thanks for the additional info. I suspected that MiniMax M2.5 might be a bit too much for this board. 230B-A10B is just a lot to ask of the 395+ even with aggressive quantization. Particularly when you consider that the model is going to spend a lot of tokens thinking and that will eat into the comparatively smaller context window.

I switched from the Unsloth 4-bit quant of Qwen3 Coder Next to the official 4-bit quant from Qwen. Using their recommended settings I had it running with OpenCode last night and it seemed to be doing quite well. No infinite loops. Given its speed, large context window, and willingness to experiment like you mentioned I think it might actually be the best option for agentic coding on the 395+ for now.

I am curious about https://huggingface.co/stepfun-ai/Step-3.5-Flash given that it does parallel token generation. It might be fast enough despite being similar in size to M2.5. However, it seems there are still some issues that llama.cpp and stepfun need to work out before it's ready for everyday use.

nyrikki 116 days ago

It is crazy to me that it is that slow, 4 bit quants don't lose much with Qwen3 coder next and unsloth/Qwen3-Coder-Next-UD-Q4_K_XL gets 32 tps with a 3090 (24gb) as a VM with 256k context size with llama.cpp

Same with unsloth/gpt-oss-120b-GGUF:F16 gets 25 tps and gpt-oss20b gets 195 tps!!!

The advantage is that you can use the APU for booting, and pass through the GPU to a VM, and have nice safer VMs for agents at the same time while using DDR4 IMHO.

lambda 116 days ago

Yeah, this is an AMD laptop integrated GPU, not a discrete NVIDIA GPU on a desktop. Also, I haven't really done much to try tweaking performance, this is just the first setup I've gotten that works.

nyrikki 116 days ago

The memory bandwidth of the Laptop CPU is better for fine tuning, but MoE really works well for inference.

I won’t use a public model for my secret sauce, no reason to help the foundation models on my secret sauce.

Even an old 1080ti works well for FIM for IDEs.

IMHO the above setup works well for boilerplate and even the sota models fail for the domain specific portions.

While I lucked out and foresaw the huge price increases, you can still find some good deals. Old gaming computers work pretty well, especially if you have Claude code locally churn on the boring parts while you work on the hard parts.

lambda 116 days ago

Yeah, I have a lot of problems with the idea of handing our ability to write code over to a few big Silicon Valley companies, and also have privacy concerns, environmental concerns, etc, so I've refused to touch any agentic coding until I could run open weights models locally.

I'm still not sold on the idea, but this allows me to experiment with it fully locally, without paying rent to some companies I find quite questionable, and I can know exactly how much power I'm drawing and the money is already spent, I'm not spendding hundreds a month on a subscription.

And yes, the Strix Halo isn't the only way to run models locally for a relatively affordable price; it's just the one I happened to pick, mostly because I already needed a new laptop, and that 128 GiB of unified RAM is pretty nice even when I'm not using most of it for a model.

cowmix 116 days ago

If you don't mind saying, what distro and/or Docker container are you using to bet Qwen3 Coder Next going?

lambda 116 days ago

I'm running Fedora Silverblue as my host OS, this is the kernel:

  $ uname -a
  Linux fedora 6.18.9-200.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Feb  6 21:43:09 UTC 2026 x86_64 GNU/Linux

You also need to set a few kernel command line paramters to set it up to allow it to use most of your memory as graphics memory, I have the following in my kernel command line, those are each 110 GiB expressed in number of pages (I figure leaving 18 GiB or so for CPU memory is probably a good idea):

  ttm.pages_limit=28835840 ttm.page_pool_size=28835840

Then I'm running llama.cpp in the official llama.cpp Docker containers. The Vulkan one works out of the box. I had to build the container myself for ROCm, the llama.cpp container has ROCm 7.0 but I need 7.2 to be compatible with my kernel. I haven't actually compared the speed directly between Vulkan and ROCm yet, I'm pretty much at the point where I've just gotten everything working.

In a checkout of the llama.cpp repo:

  podman build -t llama.cpp-rocm7.2 -f .devops/rocm.Dockerfile --build-arg ROCM_VERSION=7.2 --build-arg ROCM_DOCKER_ARCH='gfx1151' .

Then I run the container with something like:

  podman run -p 8080:8080 --device /dev/kfd --device /dev/dri --security-opt seccomp=unconfined --security-opt label=disable --rm -it -v ~/.cache/llama.cpp/:/root/.cache/llama.cpp/ -v ./unsloth:/app/unsloth llama.cpp-rocm7.2  --model unsloth/MiniMax-M2.5-GGUF/UD-Q3_K_XL/MiniMax-M2.5-UD-Q3_K_XL-00001-of-00004.gguf --jinja --ctx-size 16384 --seed 3407 --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --port 8080 --host 0.0.0.0 -dio

Still getting my setup dialed in, but this is working for now.

Edit: Oh, yeah, you had asked about Qwen3 Coder Next. That command was:

  podman run -p 8080:8080 --device /dev/kfd --device /dev/dri --security-opt seccomp=unconfined --security-opt label=disable \
    --rm -it -v ~/.cache/llama.cpp/:/root/.cache/llama.cpp/ -v ./unsloth:/app/unsloth llama.cpp-rocm7.2  -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q6_K_XL \
    --jinja --ctx-size 262144 --seed 3407 --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --port 8080 --host 0.0.0.0 -dio

(as mentioned, still just getting this set up so I've been moving around between using `-hf` to pull directly from HuggingFace vs. using `uvx hf download` in advance, sorry that these commands are a bit messy, the problem with using `-hf` in llama.cpp is that you'll sometimes get surprise updates where it has to download many gigabytes before starting up)

nyrikki 116 days ago

I can't answer for the OP but it works fine under llama.cpp's container.

consp 116 days ago

20k for such a setup for a hobbyist? You can leave the somewhat away and go into sub 1% region globally. A kw of power is still 2k/year at least for me, not that I expect it will run continuously but still not negligible if you can do with 100-200 a year on cheap subscriptions.

dec0dedab0de 116 days ago

There are plenty of normal people with hobbies that cost much more. Off the top of my head, recreational vehicles like racecars and motorcycles, but im sure there are others.

You might be correct when you say the global 1%, but that's still 83 million people.

markb139 116 days ago

I used to think photography was an expensive hobby until my wife got back into the horse world.

simonw 116 days ago

"a (somewhat wealthy) hobbyist"

manwe150 116 days ago

Reminder to others that $20k is the one time startup cost, and is amortized perhaps 2-4k/year (plus power). That is in the realm of a mere gym membership around me for a family

vuggamie 116 days ago

So 5-10 years to amortize the cost. You could get 10 years of Claude Max and your $20k could stay in the bank in case the robots steal your job or you need to take an ambulance ride in the US.

blibble 116 days ago

> And it's hard to imagine that the hardware costs don't come down quite a bit.

have you paid any attention to the hardware situation over the last year?

this week they've bought up the 2026 supply of disks

newsoftheday 116 days ago

> a cost that a (somewhat wealthy) hobbyist can afford

$20,000 is a lot to drop on a hobby. We're probably talking less than 10%, maybe less than 5% of all hobbyists could afford that.

charcircuit 116 days ago

You can rent computer from someone else to majorly reduce the spend. If you just pay for tokens it will be cheaper than buying the entire computer outright.

xboxnolifes 116 days ago

Up front, yeah. But people with hobbies on the more expensive end can definitely put out 4k a year. Im thinking like people who have a workshop and like to buy new tools and start projects.

lm28469 116 days ago

90% of companies would go bankrupt in a year if you replaced their engineering team with execs talking to k2...

trentnix 116 days ago

Most execs I've worked with couldn't tell their engineering team what they wanted with any specificity. That won't magically get any better when they talk to an LLM.

If you can't write requirements an engineering team can use, you won't be able to write requirements for the robots either.

msp26 116 days ago

Horrific comparison point. LLM inference is way more expensive locally for single users than running batch inference at scale in a datacenter on actual GPUs/TPUs.

AlexandrB 116 days ago

How is that horrific? It sets an upper bound on the cost, which turns out to be not very high.

PlatoIsADisease 116 days ago

>24 tokens/second

this is marketing not reality.

Get a few lines of code and it becomes unusable.

qaq 116 days ago

If I remember correctly Dario had claimed that AI inference gross profit margins are 40%-50%

gjk3 116 days ago

Why do you people trust what he has to say? Like omg dude. These folks play with numbers all the time to suit their narrative. They are not independently audited. What do you think scares them about going public? Things like this. They cannot massage the numbers the same way they do in the private market.

The naivete on here is crazy tbh.

qaq 116 days ago

Pretty poor narrative tbh. As things stand they will not be profitable unless stop developing new models or get to AGI. So very likely never.