|
|
|
|
|
by bluecoconut
990 days ago
|
|
I think people are misreading this work, and assuming this is equivalent to full dense-attention. This is just saying its an efficiency gain over sliding window re-computation, where instead of computing the L^2 cost over and over (T times), you can re-use a cache and maintain perplexity. I don't think they are claiming that this allows for attending to content that was far away. They tested by running concatenating and measuring -> `Q A Q A Q A Q A...` not by doing `Q Q Q Q A A A A...` They also measure perplexity, showing that it produces "readable text" (coherent, locally viable); not that it is "extracting anything" from the big-triangle-gap of no-attention. I think this would fail to be given a book, then write the first word of every paragraph. Or, given a book, write a 1 sentence summary of each chapter. I might be wrong, because they didn't test tasks like this, but I'd be very very surprised. |
|
Just tested it - this definitely doesn't seem to be giving enhanced context length. It does run quickly though, can confirm it was using about 35 GB of an A100 RAM and pinned the usage for the entire duration.
I ran through by getting a book from project gutenberg, splitting it into paragraphs, and feeding them in paragraph by paragraph (asking it to say "okay" each paragraph), then at the end, asked some questions. It entirely hallucinated its answers. (also note: in the ~10 min of playing with this, i couldn't get the base model (lmsys/vicuna-13b-v1.3) to respond in english...)
https://gist.github.com/bluecoconut/9cae9e91fe3b1616ed650a96...