As if inference exists in a bubble. Driving a car from point A to point B costs $0, as long as you exclude the cost of the car or the fuel you purchased before you were at point A.
I believe that, equally it's so unverifiable that it's a point of faith.
I'm not suggesting it's an outright lie, but rather it's easy to massage the costs to make it look true even if it isnt. Eg does GPU cost go into inference cost or not?
I can run quite useful models on my PC. Might not change the world but I got a usable transcript of an old foreign language TV show and then machine translated to English. It is not as good as professional subtitles but i wasn't willing to pay the cost of that option.
I did something similar with Whisper a year or so ago.
9 years ago, when my now wife and I were dating, we took a long cross-country road trip, and for a lot of it, we listened to NPR's Ask Me Another (a comedy trivia game).
Anyway, on one random episode, there was a joke in the show that just perfectly fit what we were doing at that exact moment. We laughed and laughed and soon forgot about it.
Years later, I wanted to find that again and purposely recreate the same moment.
I downloaded all 300 episodes as MP3s. I used Whisper to generate text transcripts, followed by a little bit of grepping, and I found the one 4-second joke that otherwise would have been lost to time.
I downloaded 2GiB of data and let a script run for 56 hours. Besides a bit of my time, which I found to be enjoyable, it didn't cost me anything.
Maybe you could argue it cost some electricity, but... In reality, it meant my computer, which runs 24/7 pulling ~185W, was running at ~300W for 56 hours... Thusly.. 300 - 185 = 115W * 56H = 6.44kWh @ $0.13 per kWh = $0.85 + tax.
So... Yes, it was very much worth $0.85 to make my wife happy.
It's a little bit more complicated than that if you were running a business.
You would want to add the cost of your network+hardware depreciating over the timeframe, and you probably can't just ignore the first 185W since if you are Anthropic it doesn't seem likely that the idle power draw would be needed if they weren't expecting to serve AI traffic.
So, let's say $0.02 per hour ($1/50 roughly). That's about $15 per month per user. Let's call it $10 per month per user since users aren't constantly hammering the service. To support a big sales and marketing engine, you would like to be selling subscriptions for $100+ per month. I'm just not sure people are prepared to pay that for AI in its current form.
"we will be left with local models that can be sort of useful but also sort of sucks" is not really a great proposition for the obscene amount of money being invested in this.
Won’t those models gradually become outdated (for anything related to events that happen after the model was trained, new code languages or framework versions, etc) if no one is around to continually re-train them?
There's a gazillion use cases for these things in business that aren't even beginning to be tapped yet. Demand for tokens should be practically unlimited for many years to come. Some of those ideas won't be financially viable but a lot will.
Consider how much software is out there that can now be translated into every (human) language continuously, opening up new customers and markets that were previously being ignored due to the logistical complexity and cost of hiring human translation teams. Inferencing that stuff is a no brainer but there's a lot of workflow and integration needed first which takes time.
Perhaps surprisingly considering the current stratospheric prices of GPUs, the performance-per-dollar of compute is still rising faster than exponentially. In a handful years it will be cheap to train something as powerful as the models that cost millions to train today. Algorithmic efficiencies also stack up an make it cheaper to build and serve older models even on the same hardware.
It’s underappreciated that we would already be in a pretty absurdly wild tech trajectory just due to compute hyperabundance even without AI.
Not at all, otherwise models with knowledge cutoffs of six months to a year ago (all current SOTA models) would be useless. Current information is fed into the model as part of the prompt. This is why they use web search.
The main reason they train new models is to make them bigger and better using the latest training techniques, not to update them with the latest knowledge.
I'm trying to avoid getting into the habit of asking LLMs about current events, or really any events. Or really facts at all.
I think LLMs work best when you give it data, and ask it to try make sense of it, or find something interesting, or some problem. To see something I can't see, then I can go back and go back to the original data and make sure its true.
There are a number of techniques to modify a model post-training. Some of those techniques allow adding current events to the model's "knowledge" without having to do an entire from-scratch training run,
saving money.
The models get more efficient every year and consumer chips get more capable every year. A GPT-5 level model will be on every phone running locally in 5 years.
I run models for coding on my own machines. They’re a trivial expense compared to what I earn from the work I do.
The “at a loss” scenario comes from (1) training costs and (2) companies selling tokens below market to get market share. Neither of those imply that people won’t run models in future. Training new frontier-class models could potentially become an issue, but even that seems unlikely given what these models are capable of.
It's unclear if people would pay the price to use them if they were not below market.
I have access to quite a few models, and I use them here and there. They are sort of useful, sometimes. But I don't pay directly for any of them. Honestly, I wouldn't.
Ok, running them locally, that's definitely a thing.
But then, without this huge financial and tech bubble that's driven by these huge companies:
1/ will those models evolve, or new models appear, for a fraction of the cost of building them today?
2/ will GPU (or their replacement) also cost a fraction of what they cost today, so that they are still integrated in end-user processors, so that those model can run efficiently?
Given the popularity and activity and pace of innovation seen on /r/LocalLLaMa, I do think models will keep improving. Likely not at the same pace as they are today, but those people love tinkering but it's mostly enthusiasts with a budget for a fancy setup in a garage, independent researchers and smaller businesses doing research there.
These people won't sit still and models will keep getting better as well as cheaper to run.
No-one on LocalLlama is training their own models. They’re working with foundation models like Llama from Meta and tweaking them in various ways: fine tuning, quantizing, RAG, etc. There’s a limit to how much improvement can be made like that. The basic capabilities of the foundation model still constrain what’s possible.