Y
Hacker News
new
|
ask
|
show
|
jobs
by
killerstorm
397 days ago
Another paper related to attention distillation, although doing something far more radical: transformer attention is distilled onto RWKV-like model:
https://huggingface.co/papers/2505.03005