|
|
|
|
|
by alevskaya
1057 days ago
|
|
Yeah we used to use this in our older models years ago... I don't recall the details exactly, but I don't think it ever did very much. I certainly don't think it will help at all with stability. Things like Q/K layernorm are better tricks for softmax stability when scaling: https://arxiv.org/pdf/2302.05442.pdf |
|
How would you have known if the trick actually reduces the outliers in the weights? Even if the transformer quality does not improve overall, having less outliers as a result is very beneficial for more accurate quantization of the data