Hacker News new | ask | show | jobs
by humbleharbinger 1120 days ago
Not sure if it answers your question but the paper notes something similar in its discussion of limitations:

> the linear attention of RWKV leads to significant efficiency gains but still, it may also limit the model’s performance on tasks that require recalling minutiae information over very long contexts. This is due to the funneling of information through a single vector representation over many time steps, compared with the full information maintained by the quadratic attention of standard Transformers. In other words, the model’s recurrent architecture inherently limits its ability to “look back” at previous tokens, as opposed to traditional self-attention mechanisms. While learned time decay helps prevent the loss of information, it is mechanistically limited compared to full self- attention.

> Another limitation of this work is the increased importance of prompt engineering in comparison to standard Transformer models. The linear attention mechanism used in RWKV limits the information from the prompt that will be carried over to the model’s continuation. As a result, carefully designed prompts may be even more crucial for the model to perform well on tasks

1 comments

Prompt design is definitely a huge one shifting to rwkv

Sadly too many folks copy and paste what works for openAI and move on when it fails