|
|
|
|
|
by pk-protect-ai
916 days ago
|
|
Is there an original paper discussion? I seem to have missed it. It's quite interesting. I didn't catch on to this part: "We note that full results on context length 8k are missing for the RWKV and RetNet baselines, prior strong recurrent models that can also be interpreted as SSMs, due to a lack of efficient implementation leading to out-of-memory or unrealistic computation requirements." RetNet doesn't really consume much memory, and with the chunkwise forward implementation, it restricts the VRAM usage to the chunk size. This is the part to test the context length. Has anyone done some tests on the original Mamba model? How fast is the training on this one in comparison with RetNet in parallel forward mode? |
|
https://openreview.net/forum?id=AL1fq05o7H