| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jasonsb 245 days ago
	> Typical savings: 60-90% on most requests, since Gemini Flash is often free/cheapest, but you still get Claude or GPT-4 when needed. This claim seems overstated. Accurately routing arbitrary prompts to the cheapest viable model is a hard problem. If it were reliably solvable, it would fundamentally disrupt the pricing models of OpenAI and Anthropic. In practice, you'd either sacrifice quality on edge cases or end up re-running failed requests on pricier models anyway, eating into those "savings".

1 comments

moduspol 245 days ago

I genuinely wonder the use cases are where the required accuracy is so low (or I guess the prompts are so strong) that you don't need to vigorously use evals to prevent regressions with the model that works best--let alone actually just change models on the fly based on what's cheaper.

link

growt 245 days ago

Yes and in addition for some reason that use case is also not a fit for some cheap OS model like qwen or kimi, but must be run on the cheapest model of the big three.

link