Hacker News new | ask | show | jobs
by spmurrayzzz 2 days ago
I don't think its smoke and mirrors, though I do have plenty of gripes with how the labs market this product landscape generally speaking.

The newest biggest model can still matter even if you do not run every prompt through it. You'll always have some task where even small amounts of loss are unacceptable and thus you need to make sure frontier intelligence is used for it.

On the router point, yes, routing has some overhead. But the router does not need to run the biggest model to decide which model to use. We've been using tiny classifiers for recommendation engines for ages now, usually on CPU. If routing saves you from sending a large fraction of traffic to the expensive reasoning model, the routing overhead can easily be worth it.

> Also, if there is significant gains from caching, then like.. what are even doing here? Inputting something and then reading cached pieces of text based on their similarity to the input? Kinda like a search engine?

The caching I'm talking about is explicitly the attention/kv cache, so its not input similarity retrieval (that would be more like what you'd use in a RAG/IR system). Prompt caching is generally about reusing already-computed attention scores for repeated prompt prefixes. The idea being you don't recompute the same static system prompt, tool definitions, schemas, long shared context, or repeated boilerplate every time. In more sophisticated systems, you usually store multiple checkpoints so that a small prompt change doesn't result in all-or-nothing hit/miss scenario.