|
|
|
|
|
by innerlee
312 days ago
|
|
The singular defects (or high-norm tokens) [1] may be related to attention sinks. It is interesting that the direction of all high-norm tokens share the same direction. Maybe the theory behind is not very complex and the issue can be fixed cleverly during training. [1] https://openreview.net/pdf?id=4yBnUokU2v |
|