| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by orbital-decay 508 days ago

That's because it uses a long CoT. The actual paper [1] [2] talks about the limitations of decoder-only transformers predicting the reply directly, although it also establishes the benefits of CoT for composition.

This is all known for a long time and makes intuitive sense - you can't squeeze more computation from it than it can provide. The authors just formally proved it (which is no small deal). And Quanta is being dramatic with conclusions and headlines, as always.

[1] https://arxiv.org/abs/2412.02975

[2] https://news.ycombinator.com/item?id=42889786

5 comments

antirez 508 days ago

LLMs using CoT are also decoder-only, it's not a paradigm shift like people want to claim now to don't say they were wrong: it's still next token prediction, that is forced to explore more possibilities in the space it contains. And with R1-Zero we also know that LLMs can train themselves to do so.

link

janalsncm 508 days ago

That’s a different paper than the one this article describes. The article describes this paper: https://arxiv.org/abs/2305.18654

link

mkl 507 days ago

The article describes both papers.

link

usaar333 507 days ago

A paper that came out 15 months ago?

link

mkl 507 days ago

Yes! That one's linked in paragraph three.

link

teruakohatu 508 days ago

gpt-4o, asked to produce swi-prolog code, gets the same result using a very similar code. gpt4-turbo can do it with slightly less nice code. gpt-3.5-turbo struggled to get the syntax correct but I think with some better prompting could manage it.

COT is defiantly optional. Although I am sure all LLM have seen this problem explained and solved in training data.

link

mycall 508 days ago

This doesn't include Encoder-Decoder Transformer Fusion for machine translation, or Encoder-Only like text classification, named entity recognition or BERT.

link

leonidasv 508 days ago

Also, notice that the original study is from 2023.

link