|
|
|
|
|
by dezgeg
74 days ago
|
|
I've understood that in more recent models you need to run extra compute to get a human-readable version of the thinking tokens, so it does impact latency. Though probably the more important motive is you can squeeze in more concurrent users by skipping this. |
|
What Anthropic is doing is still generating the thinking tokens (because they improve answer quality) without showing it to them. I believe this may actually hint at a future where these LLM vendors don’t want to show the internal reasoning like they do right now.
I’m very much of the opinion that hiding them from the response because it “improves latency” is nonsense.