|
|
|
|
|
by Aedelon
111 days ago
|
|
Author here. The core claim: RWKV-7 (2.9B params, RNN) scores 72.8% avg across
standard benchmarks vs LLaMA 3.2's 69.7% — trained on 3.1T tokens vs ~9T.
Same parameter count, one-third the data. The more interesting result is architectural: RWKV-7 formally exceeds TC⁰,
the complexity class bounding standard Transformers (Merrill & Sabharwal's
proof in the paper). It solves state-tracking problems that fixed-depth
attention provably cannot. Inference runs in O(1) memory per token — no KV cache. The hybrid variant
(RWKV-X) hits 99.8% passkey retrieval at 64K and 1.37x Flash Attention v3
throughput at 128K. Paper: https://arxiv.org/abs/2503.14456 (COLM 2025, peer-reviewed) Weights: https://huggingface.co/collections/RWKV/rwkv-v7-67d43835efa2... Code: https://github.com/BlinkDL/RWKV-LM (Apache 2.0) Happy to discuss the delta rule generalization, the TC⁰ proof, or the
benchmark methodology — I spent 36 sources digging into the caveats. |
|
https://huggingface.co/datasets/nyuuzyou/archiveofourown/dis...