|
|
|
|
|
by WithinReason
816 days ago
|
|
That's what I thought at first, but that actually doesn't make sense, the amount of work done on a string is the same even if the string is followed by padding due to the mask used in attention. Then I realised that an LLM's working memory is limited to its activations, which can be limiting. But it can extend its working memory by writing partial results to the output and reading it in. E.g. if you tell it to "think of a number" without telling you what it is it can't do that, there is nowhere to store that number, it has no temporary storage other than the tape. But if you ask it to "think step by step" you let it store intermediate results (thoughts) on the tape, giving it extra storage it can use for thinking. |
|