| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by wruza 436 days ago
	Is there a well-known benchmark for this? I don't feel that short vs long answers make any difference, but ofc feelings aren't what we can measure. Also, if that works, why doesn't copilot/cursor write lots of excessive code mixed with lots of prose only to distill it later?

1 comments

teruakohatu 436 days ago

> don't feel that short vs long answers make any difference

The “thinking” models are really verbose output models that summarise the thinking at the end. These tend to outperform non-thinking models, but at a higher cost.

Anthropic lets you see some/all of the thinking so you can see how the model arrived at the answer.

link

wruza 436 days ago

So if I replace "answer" with "summarize" that should work then?

link