Hacker News new | ask | show | jobs
by zozbot234 56 days ago
It's not even a fixed cost per token (even though it's billed that way, and that's still miles better than a fixed-price all you can eat). You're incurring a cost that's proportional to generated tokens times the context for each (plus the prefill cost for any uncached input), so the expense grows quadratically with your average generated context.

This all becomes extremely visible when trying to do agentic coding with local language models - you quickly realize that controlling context length and model size is just as important as avoiding wasted effort. The real scam is not AI Q&A ala ChatGPT, that's actually quite viable - though marginally less so as conversations grow longer. It's agentic coding with SOTA models and huge contexts.

1 comments

Using larger contexts often costs more in the APIs or consume more of your quota but this is becoming less of a problem with models using more clever attention mechanisms and not just full attention on all layers.

You can look at: https://sebastianraschka.com/llm-architecture-gallery/ and see how much things have changed.

This is also something of a non issue because as context grows and attention gets diluted, the models perform worse. It'll cost Anthropic more to run your 900k context session, yes, but it's in your interest not to have a 900k session in the first place.
You’re right about performance degradation, but good luck trying to sell that as a product.

You can drive this car, but the last mile of this trip will use as much gas as the first 20 miles.

I think it’s in anthropics interest to keep this fact hidden from CEOs who push for ai adoption.