Hacker News new | ask | show | jobs
by mathis 328 days ago
This might be more pure, but there is nothing to be gained. On the contrary, this would lead to very long sequences for which self-attention scales poorly.