| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sebzim4500 1135 days ago

>Because we don't think one word at a time

In what sense does an LLM think one word at a time that doesn't also apply to a person typing at a keyboard? I'm typing one word at a time right now, I assume you aren't about to declare me a markov chain. When I read my brain presumably ingests one word at a time (not sure if it's one exactly, but it can't be much more than one). It is of course true that I have some notion of what I'm going to say before I right the first word, but seemingly so does an LLM.

If it was truly thinking one word at a time, it wouldn't be able to consistently use 'an' vs 'a' correctly, for example.

>we don't restart from scratch for every subsequent word.

LLMs don't restart from scratch for every word, via the attention heads they can look back through the entire context. Otherwise the memory required for inference wouldn't scale with the context length.

1 comments

hospitalhusband 1135 days ago

> In what sense does an LLM think one word at a time that doesn't also apply to a person typing at a keyboard?

Because you already have the thought formed before you started typing.

> When I read my brain presumably ingests one word at a time (not sure if it's one exactly, but it can't be much more than one)

And these models ingest many vectors at once, up to the context length. Your brain is also recursive, and regularly goes backwards to rescan earlier words as necessary.

Seems to me it's fundamentally inverted from how we operate, both input and output.

link

sebzim4500 1135 days ago

>Because you already have the thought formed before you started typing.

Can you prove that GPT-4 doesn't? Clearly there is a sense in which thinks more than one word ahead, since as I mentioned above it would not otherwise be able to use 'a' vs 'an' correctly.

As far as I am aware, exactly to what extent these models have determined what tokens will be generated before they produce anything is an open question in mechanistic interpratability research. I would be very interested if you knew of some work that answers this question empirically.

link