|
|
|
|
|
by wruza
436 days ago
|
|
Is there a well-known benchmark for this? I don't feel that short vs long answers make any difference, but ofc feelings aren't what we can measure. Also, if that works, why doesn't copilot/cursor write lots of excessive code mixed with lots of prose only to distill it later? |
|
The “thinking” models are really verbose output models that summarise the thinking at the end. These tend to outperform non-thinking models, but at a higher cost.
Anthropic lets you see some/all of the thinking so you can see how the model arrived at the answer.