Hacker News new | ask | show | jobs
by stingraycharles 59 days ago
While the caveman stuff is obviously not serious, there is a lot of legit research in this area.

Which means yes, you can actually influence this quite a bit. Read the paper “Compressed Chain of Thought” for example, it shows it’s really easy to make significant reductions in reasoning tokens without affecting output quality.

There is not too much research into this (about 5 papers in total), but with that it’s possible to reduce output tokens by about 60%. Given that output is an incredibly significant part of the total costs, this is important.

https://arxiv.org/abs/2412.13171

3 comments

Who would suspect that the companies selling 'tokens' would (unintentionally) train their models to prefer longer answers, reaping a HIGHER ROI (the thing a publicly traded company is legally required to pursue: good thing these are all still private...)... because it's not like private companies want to make money...
I don’t think this is a plausible argument, as they’re generally capacity constrained, and everyone would like shorter (= faster) responses.

I’m fairly certain that in a few more releases we’ll have models with shorter CoT chains. Whether they’ll still let us see those is another question, as it seems like Anthropic wants to start hiding their CoT, potentially because it reveals some secret sauce.

I guess mainly they don’t want you to distill on their CoT
Yes, which I understand, but I think they’re crippling their product for users this way.

I don’t think it’s just this, because the thinking tokens often reveal more about Anthropic’s inner workings. For example, it’s how the whole existence of Claude’s soul document was reverse engineered, it often leaks details about “system reminders” (eg long conversation reminders).

I think it’s also just very convenient for Anthropic to do this. The fact that they’re also presenting this as a “performance optimization” suggests they’re not giving the real reason they do this.

Try setting up one laundry which charges by the hour and washes clothes really really slowly, and another which washes clothes at normal speed at cost plus some margin similar to your competitors.

The one which maximizes ROI will not be the one you rigged to cost more and take longer.

I don't think the analogy is correct here.

Directionally, tokens are not equivalent to "time spent processing your query", but rather a measure of effort/resource expended to process your query.

So a more germane analogy would be:

What if you set up a laundry which charges you based on the amount of laundry detergent used to clean your clothes?

Sounds fair.

But then, what if the top engineers at the laundry offered an "auto-dispenser" that uses extremely advanced algorithms to apply just the right optimal amount of detergent for each wash?

Sounds like value-added for the customer.

... but now you end up with a system where the laundry management team has strong incentives to influence how liberally the auto-dispenser will "spend" to give you "best results"

Shades of “repeat” in lather, rinse, repeat.
LLM APIs sell on value they deliver to the user, not the sheer number of tokens you can buy per $. The latter is roughly labor-theory-of-value levels of wrong.
Some labs do it internally because RLVR is very token-expensive. But it degrades CoT readability even more than normal RL pressure does.

It isn't free either - by default, models learn to offload some of their internal computation into the "filler" tokens. So reducing raw token count always cuts into reasoning capacity somewhat. Getting closer to "compute optimal" while reducing token use isn't an easy task.

Yeah the readability suffers, but as long as the actual output (ie the non-CoT part) stays unaffected it’s reasonably fine.

I work on a few agentic open source tools and the interesting thing is that once I implemented these things, the overall feedback was a performance improvement rather than performance reduction, as the LLM would spend much less time on generating tokens.

I didn’t implement it fully, just a few basic things like “reduce prose while thinking, don’t repeat your thoughts” etc would already yield massive improvements.

Yeah you could easily imagine stenography like inputs and outputs for rapid iteration loops. It's also true that in social media people already want faster-to-read snippets that drop grammar so the desire for density is already there for human authors/readers.