Hacker News new | ask | show | jobs
by killerstorm 397 days ago
Another paper related to attention distillation, although doing something far more radical: transformer attention is distilled onto RWKV-like model: https://huggingface.co/papers/2505.03005