Hacker News new | ask | show | jobs
by simonw 356 days ago
"In a move that at first went unnoticed, Google significantly increased the price of its popular Gemini 2.5 Flash model"

It's not quite that simple. Gemini 2.5 Flash previously had two prices, depending on if you enabled "thinking" mode or not. The new 2.5 Flash has just a single price, which is a lot more if you were using the non-thinking mode and may be slightly less for thinking mode.

Another way to think about this is that they retired their Gemini 2.5 Flash non-thinking model entirely, and changed the price of their Gemini 2.5 Flash thinking model from $0.15/m input, $3.50/m output to $0.30/m input (more expensive) and $2.50/m output (less expensive).

Another minor nit-pick:

> For LLM providers, API calls cost them quadratically in throughput as sequence length increases. However, API providers price their services linearly, meaning that there is a fixed cost to the end consumer for every unit of input or output token they use.

That's mostly true, but not entirely: Gemini 2.5 Pro (but oddly not Gemini 2.5 Flash) charges a higher rate for inputs over 200,000 tokens. Gemini 1.5 also had a higher rate for >128,000 tokens. As a result I treat those as separate models on my pricing table on https://www.llm-prices.com

One last one:

> o3 is a completely different class of model. It is at the frontier of intelligence, whereas Flash is meant to be a workhorse. Consequently, there is more room for optimization that isn’t available in Flash’s case, such as more room for pruning, distillation, etc.

OpenAI are on the record that the o3 optimizations were not through model changes such as pruning or distillation. This is backed up by independent benchmarks that find the performance of the new o3 matches the previous one: https://twitter.com/arcprize/status/1932836756791177316

2 comments

Both great points, but more or less speak to the same root cause - customer usage patterns are becoming more of a driver for pricing than underlying technology improvements. If so, we likely have hit a "soft" floor for now on pricing. Do you not see it this way?
Even given how much prices have decreased over the past 3 years I think there's still room for them to keep going down. I expect there remain a whole lot of optimizations that have not yet been discovered, in both software and hardware.

That 80% drop in o3 was only a few weeks ago!

No doubt prices will continue to drop! We just don't think it will be anything like the orders-of-magnitude YoY improvements we're used to seeing. Consequently, developers shouldn't expect the cost of building and scaling AI applications to be anything close to "free" in the near future as many suspect.
I do not see it this way. Google is a publicly traded company responsible for creating value for their shareholders. When they became dicks about ad blockers on youtube last year or so, was it because they hit a bandwidth Moore's law? No. It was a money grab.

ChatGPT is simply what Google should've been 5-7 years ago, but Google was more interested in presenting me with ads to click on instead of helping me find what I was looking for. ChatGPT is at least 50% of my searches now. And they're losing revenue because of that.

I really hate the thinking. I do my best to disable it but don't always remember. So often it just gets into a loop second guessing itself until it hits the token limit. It's rare it figures anything out while it's thinking too but maybe that's because I'm better at writing prompts.
I have the impression that the thinking helps even if the actual content of the thinking output is nonsense. It awards more cycles to the model to think about the problem.
That would be strange. There's no hidden memory or data channel, the "thinking" output is all the model receives afterwards. If it's all nonsense, then nonsense is all it gets. I wouldn't be completely surprised if a context with a bunch of apparent nonsense still helps somehow, LLMs are weird, but it would be odd.
This isn't quite right. Even when an LLM generates meaningless tokens, its internal state continues to evolve. Each new token triggers a fresh pass through the network, with attention over the KV cache, allowing the model to refine its contextual representation. The specific tokens may be gibberish, but the underlying computation can still reflect ongoing "thinking".
Attention operates entirely on hidden memory, in the sense that it usually isn't exposed to the end user. An attention head on one thinking token can attend to one thing and the same attention head on the next thinking token can attend to something entirely different, and the next layer can combine the two values, maybe on the second thinking token, maybe much later. So even nonsense filler can create space for intermediate computation to happen.
Wasn't there some study that just telling the LLM to write a bunch of periods first improves responses?
There are several such papers, off the top of my head one is https://arxiv.org/abs/2404.15758

It's a bit more subtle though, if I understand correctly this only works for parallelizable problems. Which makes intuitive sense since the model cannot pass information along with each dot. So in that sense COT can be seen as some form of sampling, which also tracks with findings that COT doesn't boost the "raw intelligence" but rather uncovers latent intelligence, converting pass@k to maj@k. Antirez touches upon this in [1].

On the other hand, I think problems with serial dependencies require "real" COT since the model needs to track the results of subproblems. There's also some studies which show a meta-structure to the COT itself though, e.g. if you look at DeepSeek there are clear patterns of backtracking and such that are slightly more advanced than naive repeated samplings. https://arxiv.org/abs/2506.19143

[1] https://news.ycombinator.com/item?id=44288049

Although thinking a bit more, even constrained to only output dots, there can still some amount of information passing between each token, namely in the hidden states. The attention block N layers deep will compute attention scores off of the residual stream for previous inputs at that layer, so some information can be passed along this way.

It's not very efficient though, because for token i layer N can only receive as input layer N-1 for tokens i-1, i-2... So information is sort of passed along diagonally. If handwavily the embedding represents some "partial result" then it can be passed along diagonally from (N-1, i-1) to (N, i) to have the COT for token i+1 continue to work on it. So this way even though the total circuit depth is still bounded by # of layers, it's clearly "more powerful" than just naively going from layer 1...n, because during the other steps you can maybe work on something else.

But it's still not as powerful as allowing the results at layer n to be fed back in, which effectively unrolls the depth. This maybe intuitively justifies the results in the paper (I think it also has some connection to communication complexity).

Eh. The embeddings themselves could act like hidden layer activations and encode some useful information.
It's almost like there's an incentive for them to burn as many tokens as possible accomplishing nothing useful.
I hate thinking mode because I prefer a mostly right answer right now over having to wait for a probably better, but still not exactly right answer.