| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by parrit 455 days ago
	So if you now hide my original comment and try to recall what I said, do you know it word for word (and are thinking if every word, e.g. did I use one or 2 spaces somewhere as that would change tokens) or do you have a rough concept of what I said? OTOH if you had to remember a phone number to write it down, how does that differ?

1 comments

boroboro4 455 days ago

I think in a way it makes transformers superior to humans, their short term memory is much more powerful =) Supporting extra long contexts also make transformers super human. Because, again, human's short term memory is exactly this - short term. And much shorter than millions of tokens we expect from models nowadays.

As for SSMs - I think they compress model memory state way too much. Mixed global/local attention layers do just as well. And sparse/block attention seems like a way forward much more (https://arxiv.org/abs/2502.11089).

link

littlestymaar 455 days ago

> And much shorter than millions of tokens we expect from models nowadays.

Yet all current model still suck above 32k. (Yes some can do needle in a haystack fine, but they still fail at anything even slightly more complex over a long context).

32k is still much higher than humans' though, so I agree with you that it gives them some kind of super human abilities over moderately long context, but they are still disappointingly bad over longer context.

link

boroboro4 455 days ago

Out of curiosity I estimated per day context size (of text only!) by multiplying reading speed by number of minutes: 16 * 60 * 300 = 288000 words ~ 288000 tokens.

link