It's super hard to know if those prices are reflective of the true cost.
Remember that leaderboard position is very important, and many leaderboards are perf/$. So, to push the share price up and be top of leaderboards, the company might falsely quote a loss-leading price, and maybe set quotas so people can't cause too big losses.
Here's an evil theory: perhaps they had to really increase the size of Gemini Flash ... because otherwise it would have been close to or maybe even outperformed by Qwen 3.6 27b or something like that.
It must have improved considerably since I tried the "3.5-flash-preview" a couple of months ago if all these claims in the presentations are true. Because it couldn't even make changes in a 200 line Python script without doing major mistakes (like messing up argument order when calling functions) when I tried it.
flash beating the pro it was distilled from is suspicious, not surprising.distillation usually loses you something. if the smaller model is winning on agentic evals, the more likely read is the evals weren't measuring agent quality in the first place. that's the bigger problem for builders, not which model to pick.