| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by onlyrealcuzzo 20 days ago

The actual cost is going to drop 99% in ~4 years.

How much that makes it into enterprise pricing is TBD, since none of the hyper scalers are making money yet of selling AI inference.

Almost all businesses are ahead of the gun. For most of their use cases, AI is either not yet good enough on its own, or good enough but too expensive.

No one wants to get left behind, so everyone's trying to get onto it now, even though it's not ready for what most enterprises want to do with it.

It's easy for them to look at a small startup without billions of lines of legacy business logic debt and see them having success and wonder why they can't have just as much - or more - why they're bigger so they should have better and more success, right???

Wrong...

But when it gets ~99% cheaper for local inference over the next 4 years, at the same time the price per watt improve 4x -> a lot of those cases will start to pencil out.

7 comments

BearOso 20 days ago

Going from Opus 4.5 to 4.7 secretly required 6x more compute to run. 4.8 is apparently 30% more on top. I haven't seen any optimizations lately aside from distillation. Nobody's optimizing, they're just scaling up.

rescbr 20 days ago

> Nobody's optimizing

The Chinese, since they lack computing hardware due to US export controls, are.

trollbridge 20 days ago

And our export controls are going to turn China into a winner in the AI arms race if we're not careful.

rented_mule 20 days ago

I retired a few years ago, but I still write a fair bit of code. I was using Copilot's code completion before I retired, but coding agents hadn't come around yet. I've been wanting to try them, but I kept putting it off, and now the price increases make it hard to justify.

So I just started trying CodeWhale (https://github.com/Hmbown/CodeWhale) with DeepSeek V4. I expected to be impressed by the abilities (which still require plenty of oversight). I didn't expect to be completely shocked by how cheep it is. After most of a week of using it 4-8 hours a day, which would amount to a full week of coding in many jobs after you account for non-coding activities, I'm about to hit $3 in total usage. So we're talking $10-20 per month for single-agent use by a full time software developer? And I'm sure some of my usage is waste as I'm still getting my head around things like compaction. If I take a break for a few weeks, I pay nothing because there is no subscription.

If DeepSeek and Xiaomi MiMo stay within a few months of the US-based models in terms of capabilities and US companies don't figure out how to drastically cut prices, I can't see how China hasn't already won. Protectionism would be one reason, but that might be ceding 50-90% of the total addressable market, and bring us closer to moving knowledge work out of the US the same way we did with manufacturing because it's too expensive in the US.

zzleeper 20 days ago

Holy F.. $3 .. once I'm done with my base cursor allocation, each nontrivial question costs $5 . And yes, I'm now switching to a mix of codex and ds4pro

sgc 20 days ago

How are you using it? More to complete specific functions or scripts, or for larger architectural design and longer implementation runs?

rented_mule 19 days ago

My initial use was in a repo where I create models for 3d printing using a library called build123d. There are a handful of parametric models and then many instances of those models with parameters (one that's 24 mm in diameter with a cutout, another that's 42 mm in diameter but no cutout, etc.). I tend to be in a hurry when I want a new parametric model, so I've ended up just copying the one that's the most similar and changing what I want to be different.

The first big task was to find the common bits and abstract them out. It did a great job of creating a plan, summarized in a table, that gave a name to shared chunks, the line numbers in various files where they appeared, line counts of new functions vs. removed bits, and some pros/cons about splitting out each chunk. It was very well "thought out", so I told it to go ahead. It did a nice job other than straying from my coding conventions. That gave me a chance to build out my AGENTS.md file (it helped with that, too).

Once that was done, I had it create automated tests for the newly abstracted parts. I think this is probably a bad practice... I believe humans should at least define what the tests are testing so that there is a deeper understanding of what oversight is in place. But I was just trying things. It surprised me how well it did. The biggest surprise was that the tests seemed quite inspired by vision. It would try different parameters and then have comments about making sure the shape protruded in a certain way, then code that did that. I expected it to refactor a bunch of the code to make it more testable. It found a way to not touch the code while testing everything I asked it to with just two simple mocks - I hadn't foreseen that, but it felt quite practical. It was passing around several opaque tuples in the tests and accessing items in them by index. I prompted it to replace the first one with a frozen, kw-only dataclass. Then a second. On the second request, it saw the pattern and did the rest without me asking. It created 44 tests across a handful of files.

The next part is where I was the least happy. I use ruff and ty to check my code with almost all checks enabled. It was mostly good about the ruff issues. But for the type checking, it just wanted to disable 6-8 rules for the entire repo in pyproject.toml, or at least for all the tests. I had to repeatedly tell it not to and it kept telling me it wasn't recommended. When it finally gave in, it fixed most of the type issues (build123d has lots of types specified, but many operations result in type conflicts because things are so deeply overloaded). The things it didn't fix, it just left a comment to ignore type checking altogether on that line. After I did a little more brow beating, it finally changed the comments to only disable specific rules. To be fair, and unlike most of my other repos, I've had to spend way too much time getting types right in this repo myself.

My last task involved a small library management system for our little town library (tracking library cards, books, DVDs, check-outs/check-ins, etc.). I inherited it from someone who had built the entire web app out of bash/awk/troff scripts with the data in text files burdened by a lot of schema changes that he didn't really know how to deal with. I'm halfway through moving it to Python/FastAPI/SQLite. I asked it to do a security audit of the entire code base, both the newer parts and the old parts that are still in bash/awk/troff. It found everything I knew about and a few things I didn't know about. It made a decent assessment of the risks/impact of each issue. It also called out design decisions that were good security practices. One of the next big tasks will be to see how it does at continuing the migration - it has enough examples of how I've done it that I suspect it can do something fairly consistent with my thinking. I'll probably have it do one or two web pages. When I feel like it understands what I'm after, I'll tell it to use sub-agents to do the rest. I'll be very happy if I don't have to tease apart any more troff scripts that are generating PDF files!

trollbridge 20 days ago

DeepSeek and Alibaba would like to have a word.

whatthesmack 19 days ago

Hasn't everything DeepSeek and Alibaba created thus far been distilled from the results of many, many accounts logging into Claude and ChatGPT? And that's why there's so much bot detection now at US frontier labs? Doesn't that make the Chinese labs dependent until some unknown point in the future on advancements of US frontier labs? While what they currently provide is cheap, it seems like it's artificially cheap and somewhat static because they took others' intellectual property (no comment needed about US frontier labs stealing the world's knowledge... that's a separate topic).

NekkoDroid 19 days ago

> Hasn't everything DeepSeek and Alibaba created thus far been distilled from the results of many, many accounts logging into Claude and ChatGPT?

I doubt it is really any different to what the US labs do [1]. I never really bought the "they were basically all just distilling from us" shtick from Anthropic, I just assumed they were either comparing or also creating training data as basically any lab is doing.

[1]: https://www.reddit.com/r/ClaudeCode/comments/1tqaist/opus_48...

krona 20 days ago

> The actual cost is going to drop 99%

Do you mean the marginal cost by the producer, or the cost on the consumer? I can't see the price of electricity falling much, and the demand curve is apparently exponential if the hype is to be believed.

trollbridge 20 days ago

DeepSeep V4 Pro is 99% cheaper than similarly performing models were 2 years ago (if such a model even existed).

Computing has always been about how to wring out more efficiency. The ENIAC was 150,000 watts, with 3 phase 240 volt power, and cost about $500,000.

My day to day laptop (a year old) is 35 watts, with 1 phase 20 volt power, and cost $1,000, so that's 99.98% less power consumption, 99.8% cheaper, and it has about 10 orders of magnitude more computing power, all on a time span of 80 years.

cratermoon 20 days ago

Moore’s law is dead.

HappMacDonald 20 days ago

It died before AI came around and today's coding agents are somewhere upwards of twice as competent as whatever the state of the art of automatic coding was in 2020. 8I

mrandish 20 days ago

A good chunk of that was one-time gains from shifting GPU and memory architectures to better match what LLMs need at scale as well as some algorithmic improvements. Most of the low-hanging architecture optimization has already been harvested. We'll certainly have more algorithmic gains but the consensus is they'll generally be smaller and less frequent.

There's always a chance we'll have some dramatic gains far larger than DeepSeek's optimizations a year ago, but it hasn't happened again yet at even that scale. It would be nice but I certainly wouldn't count on it.

packetlost 20 days ago

I don't see how this is even remotely true. Unless there's some super breakthrough into a fundamentally different architecture, there's not really a path to a 50% reduction in price, much less a 99% reduction.

kilroy123 20 days ago

In fairness, I think _current_ capabilities will be cheaper. So the models of today will be run drastically cheaper in 4 years.

onlyrealcuzzo 20 days ago

And yet 90% drops for the same level of quality every 18 months have happened like clockwork...

And the technology already exists on the algorithmic front TODAY to lock in another 10x gain -> when, typically, algorithmic gains only account for ~30% of that drop and the other ~70% comes from better data (often synthetic) and knowledge distilation from frontier models.

Just look at DeepSeek's pricing...

datakan 20 days ago

What makes you think prices will drop? Everyone I’ve spoken to believes they will only skyrocket. Genuinely curious

onlyrealcuzzo 20 days ago

The technology already exists now on the algorithmic front for the next 10x drop between everyone adopting DeepSeek's MLA, MoE (mostly already done), Medusa (a better version of Google's speculative decoding), Kimi's Attn Residuals, and Mimo's Sliding Window Attn, and (possibly) Microsoft's 1.58b (this may be a nothing burger).

Historic trends, every 18 months, performance for the same level of quality has gone down 90%.

See: https://www.reddit.com/r/LocalLLaMA/comments/1gpr2p4/llms_co...

And Chart 13 here: https://www.rdworldonline.com/ais-great-compression-20-chart...

And here: https://epoch.ai/data-insights/llm-inference-price-trends

Historically, algorithmic gains are only ~30% of the pie, but there's enough out there to get to 10x, with just what's available already. The other ~70% of the pie is better training data (often synthetic) and distilling frontier knowledge. There's no sign we are tapped out on that front.

Additionally, GRAM (from ~10 days ago) is likely to be a 5-10x on its own (if not substantially more for smaller models). It's unlikely within 4 years LeCun's JEPA ideas and similar ideas like GRAM applied to LLMs have ZERO impact. The preliminary results are absolutely astounding (5000x better reasoning - this is not peanuts).

Further, that's not even counting that cost per watt is still dropping ~2x every 2 years on its own on the hardware front.

If you look at the "cost" of inference. People think it's electricity - but it's currently almost ~80% hardware amortization. The memory shortage is not going to last, nor are Nvidia's ~80-90% margins.

The human brain is still 8-10 orders of magnitude more efficient than the best LLMs of today. With ~1/10th of global capex riding on AI, if you don't think they're going to knock of 2 orders of magnitude more, when it's this obvious and easy... I don't know what to tell you...

Sure, it might take 6 years instead of 4. My crystal ball isn't perfect.

HarHarVeryFunny 20 days ago

Sure, the price will come down a lot, even if we can argue about the timeline.

I think what will also happen, once we get past this current CEO AI FOMO mania, is that companies will start to look at AI spending more rationally like any other company expense, and will revert to more rational decision making.

Even if the cost comes down considerably over the next few years, that's plenty of time for companies to look at their financial results and question why AI expenditure isn't resulting in increase in revenue and/or profitability.

datakan 20 days ago

This is great food for thought, thank you

onlyrealcuzzo 20 days ago

Additionally, on the context front -> all the labs are aware that for many tasks you can get 10x+ increases in output quality by feeding better context.

See https://arxiv.org/abs/2604.04364.

This won't really show up in benchmarks, but it will impact real world usage on the most common use cases.

I'm doing a study right now on the impacts of better context for small models to fix bugs.

A very dumb algorithm can make small models perform at 10x+ model sizes. I'll be surprised if it can't get to 20x+

Nimitz14 20 days ago

This is mostly slop. But you may be directionally correct

rednb 20 days ago

I didn't take you seriously initially but after reading this, i think you are the real deal.

Thank you for sharing this and for having the intellectual courage to hold to a sound reasoning that may be unpopular initially.

bakugo 20 days ago

Prices have been very obviously trending up, not down. Even open weights models are becoming more expensive with every release. Computer hardware is ballooning in price.

onlyrealcuzzo 20 days ago

Prices are going up for BETTER quality -> not for the SAME level of quality.

People are willing to pay more for BETTER quality.

You obviously haven't seen DeepSeek v4 Pro's pricing if you think pricing only goes up...

bakugo 19 days ago

Maybe so, but that becomes irrelevant when you consider that the new, better quality instantly becomes the expected baseline. So the price of the "baseline" quality is going up regardless.

Let's look at GPU prices as an example. Around 12 years ago, I bought a GTX 970 for around $350. That was considered a very good GPU at the time. Today, the "equivalent" GPU model (RTX 5070) now costs almost double. Of course, the newer GPU is much more powerful (more than double, in fact), but all the things you'd use a GPU for have also advanced and now expect an entirely new level of performance as a baseline, such that the older GPU is fairly worthless today. So most people agree that GPUs in general have become more expensive.

Regarding DeepSeek's price: it's obviously subsidized, and unlikely to match the actual inference cost right now.

abalashov 20 days ago

Just wait for the next model and the next model architecture. Just wait for it, bro.

onlyrealcuzzo 20 days ago

Gemini 3.5 flash is 25% cheaper than 3.1 pro, and outperforms it on almost every benchmark, most by a pretty wide margin...

Rebelgecko 19 days ago

It's still 5x more expensive than 2.5 flash

abalashov 20 days ago

Cool.

bigstrat2003 20 days ago

There has never yet been a new model which actually improved over the previous ones. They suck just as much, and in the same ways, as the models of 3 years ago.

trollbridge 20 days ago

Grab a 5090 and run Qwen 3.6 35b on it (6 parameter seems to work best for me).

Then buy $10 (or $2, if you're cheap, and they take PayPal) of DeepSeek credits.

Whilst you're at it spring for a Claude subscription too and GPT.

Switch models between Qwen, DeepSeek Flash, DeepSeek Pro, and you can meet 99% of your code generation needs.

Hop over to Opus 4.7 (or 4.8, but I haven't really used it yet) and GPT-5.5 when doing very complex architecture/design or troubleshooting something where DeepSeek Pro is getting stuck.

It is ridiculous how cheap this stuff is now. It's affordable at third world prices.

Supermancho 20 days ago

None of that is cheap.

> spring for a Claude subscription too and GPT.

You started with some random pricing then veered off into impractical hand waving. Far above third world prices...unless you count the USA as third world, I guess.

trollbridge 19 days ago

The extra subscriptions are optional. You can do nearly all of it with just a DeepSeek subscription and switch between Flash and Pro.

If you have the $$, do the extra stuff. People who like to play video games often have a very fancy graphics card that sits idle during their work day.

AllegedAlec 20 days ago

> The actual cost is going to drop 99% in ~4 years.

And fusion power is just 2 decades into the future!

jjav 20 days ago

Full self driving guaranteed here before the end of the year (every year).

mrandish 20 days ago

> The actual cost is going to drop 99% in ~4 years.

We have little visibility into current frontier model costs at mass scale. As a broad historical trend, tech costs tend to fall over longer time periods but your claim far exceeds Moore's Law rates in its heyday - and that heyday is long gone.

In 2021 TSMC announced it was increasing it's price per gate for new nodes for the first time in its history. In the past five years cutting edge nodes have delivered ~8-15% real-world performance gains on average at costs at least 10-20% more than the last node. If you're positing a string of unprecedented efficiency breakthroughs in LLM algorithms - such extraordinary claims require extraordinary evidence.