| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by killerstorm 444 days ago
	Another paper related to attention distillation, although doing something far more radical: transformer attention is distilled onto RWKV-like model: https://huggingface.co/papers/2505.03005