Hacker News new | ask | show | jobs
by bluecoconut 990 days ago
I think people are misreading this work, and assuming this is equivalent to full dense-attention. This is just saying its an efficiency gain over sliding window re-computation, where instead of computing the L^2 cost over and over (T times), you can re-use a cache and maintain perplexity. I don't think they are claiming that this allows for attending to content that was far away.

They tested by running concatenating and measuring -> `Q A Q A Q A Q A...` not by doing `Q Q Q Q A A A A...`

They also measure perplexity, showing that it produces "readable text" (coherent, locally viable); not that it is "extracting anything" from the big-triangle-gap of no-attention.

I think this would fail to be given a book, then write the first word of every paragraph. Or, given a book, write a 1 sentence summary of each chapter. I might be wrong, because they didn't test tasks like this, but I'd be very very surprised.

2 comments

EDIT: the authors have updated the readme to add a clarified FAQ section that directly addresses this: https://github.com/mit-han-lab/streaming-llm#faq

Just tested it - this definitely doesn't seem to be giving enhanced context length. It does run quickly though, can confirm it was using about 35 GB of an A100 RAM and pinned the usage for the entire duration.

I ran through by getting a book from project gutenberg, splitting it into paragraphs, and feeding them in paragraph by paragraph (asking it to say "okay" each paragraph), then at the end, asked some questions. It entirely hallucinated its answers. (also note: in the ~10 min of playing with this, i couldn't get the base model (lmsys/vicuna-13b-v1.3) to respond in english...)

https://gist.github.com/bluecoconut/9cae9e91fe3b1616ed650a96...

Correct, but to be fair to readers (like me) the use of the term "infinite-length inputs" is misleading.

Still, really interesting work. The most salient bit is the discovery shown in Figure 2, summarized as:

> (1) The attention maps in the first two layers (layers 0 and 1) exhibit the "local" pattern, with recent tokens receiving more attention. (2) Beyond the bottom two layers, the model heavily attends to the initial token across all layers and heads.

> surprisingly large amount of attention score is allocated to the initial tokens, irrespective of their relevance to the language modeling task, as visualized in Figure 2. We term these tokens “attention sinks". Despite their lack of semantic significance, they collect significant attention scores. We attribute the reason to the Softmax operation, which requires attention scores to sum up to one for all contextual tokens. Thus, even when the current query does not have a strong match in many previous tokens, the model still needs to allocate these unneeded attention values somewhere so it sums up to one. The reason behind initial tokens as sink tokens is intuitive: initial tokens are visible to almost all subsequent tokens because of the autoregressive language modeling nature, making them more readily trained to serve as attention sinks.

StreamingLLM is basically a "hack" that fixes this odd behavior when we go around butchering the LLM's attention window.

This actually isn't the first time cracks have been shown in the usage of softmax and it makes me wonder if a different function might be better if we want context-length flexible LLMs.