Hacker News new | ask | show | jobs
by somnial 5 days ago
true, but no reason the predictor model couldn't use linear attention (i.e. mamba, GDN etc) to predict KV caches