| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by AlphaSite 115 days ago

I think for the same model wall time is probably a more intuitive metric; at the end of the day what you’re doing is renting GPU time slices.

Large outputs dominate compute time so are more expensive.

IMO input and output token counts are actually still a bad metric since they linearise non linear cost increases and I suspect we’ll see another change in the future where they bucket by context length. XL output contexts may be 20x more expensive instead of 10x.

3 comments

nomel 114 days ago

As a customer, it's nice that I can quantize and count the units of cost in an understandable way.

For Anthropic, as a business bleeding money, it's probably nice to have value-based pricing, for the tokens, so innovation (like computation efficiency improvements) can result in some extra margin. If they exposed the more direct computation cost, they could never financially benefit from any improved efficiency, including faster hardware!

link

yencabulator 113 days ago

> I think for the same model wall time is probably a more intuitive metric; at the end of the day what you’re doing is renting GPU time slices

This is a bit too much of a simplification.

The LLM provider batches multiple customer requests into one GPU/TPU pass over the weights, with minimal latency increase.

The LLM provider may in fact be renting GPUs by the second, but the end user isn't. We the end users are essentially timesharing a pool of GPUs without any dedicated "1 vGPU" style resource allocation. In such a setting, charging by "GPU tick" sounds valid, and the various categories of token costs are an approximation of cost+margin.

link

nsomaru 115 days ago

They already bucket when context goes above 200k

link

refulgentis 114 days ago

No longer

link