Hacker News new | ask | show | jobs
by wruza 436 days ago
Is there a well-known benchmark for this? I don't feel that short vs long answers make any difference, but ofc feelings aren't what we can measure.

Also, if that works, why doesn't copilot/cursor write lots of excessive code mixed with lots of prose only to distill it later?

1 comments

> don't feel that short vs long answers make any difference

The “thinking” models are really verbose output models that summarise the thinking at the end. These tend to outperform non-thinking models, but at a higher cost.

Anthropic lets you see some/all of the thinking so you can see how the model arrived at the answer.

So if I replace "answer" with "summarize" that should work then?