Hacker News new | ask | show | jobs
by Wehrdo 1055 days ago
Here is an article that explains more about the outliers that emerge in large transformer models, which is what this modified softmax is being proposed to fix:

https://timdettmers.com/2022/08/17/llm-int8-and-emergent-fea...

The fact that these only emerge in larger models is likely one reason the author hasn't actually tried it.