| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kalkin 62 days ago
	AFAICT this uses a token-counting API so that it counts how many tokens are in the prompt, in two ways, so it's measuring the tokenizer change in isolation. Smarter models also sometimes produce shorter outputs and therefore fewer output tokens. That doesn't mean Opus 4.7 necessarily nets out cheaper, it might still be more expensive, but this comparison isn't really very useful.

4 comments

h14h 62 days ago

For some real data, Artificial Analysis reported that 4.6 (max) and 4.7 (max) used 160M tokens and 100M tokens to complete their benchmark suite, respectively:

https://artificialanalysis.ai/?intelligence-efficiency=intel...

Looking at their cost breakdown, while input cost rose by $800, output cost dropped by $1400. Granted whether output offsets input will be very use-case dependent, and I imagine the delta is a lot closer at lower effort levels.

link

theptip 62 days ago

This is the right way of thinking end-to-end.

Tokenizer changes are one piece to understand for sure, but as you say, you need to evaluate $/task not $/token or #tokens/task alone.

link

manmal 62 days ago

Why is it not useful? Input token pricing is the same for 4.7. The same prompt costs roughly 30% more now, for input.

link

dktp 62 days ago

The idea is that smarter models might use fewer turns to accomplish the same task - reducing the overall token usage

Though, from my limited testing, the new model is far more token hungry overall

link

manmal 62 days ago

Well you‘ll need the same prompt for input tokens?

link

httgbgg 62 days ago

Only the first one. Ideally now there is no second prompt.

link

manmal 62 days ago

Are you aware that every tool call produces output which also counts as input to the LLM?

link

squeaky-clean 62 days ago

Are you aware that a lot of model tool calls are useless and a smarter model could avoid those?

Are you aware that output tokens are priced 5x higher than input tokens?

link

httgbgg 61 days ago

This has no bearing on my comment. The point is that a better model avoids dozens of prompts and tool calls by making fewer CORRECT tool calls, with the user needing no more prompts.

I’m surprised this is even a question; obviously a better prompter has the same properties and it’s not in dispute?

link

kalkin 62 days ago

That's valid, but it's also worth knowing it's only one part of the puzzle. The submission title doesn't say "input".

link

SkyPuncher 62 days ago

Yes. I actually noticed my token usage go down on 4.6 when I started switching every session to max effort. I got work done faster with fewer steps because thinking corrected itself before it cycled.

I’ve noticed 4.7 cycling a lot more on basic tasks. Though, it also seems a bit better at holding long running context.

link

the_gipsy 62 days ago

With AIs, it seems like there never is a comparison that is useful.

link

theptip 62 days ago

You can build evals. Look at Harbor or Inspect. It’s just more work than most are interested in doing right now.

link

jascha_eng 62 days ago

yup its all vibes. And anthropic is winning on those in my book still

link