|
|
|
|
|
by smaddox
752 days ago
|
|
The first two changes appear theoretically sound, but it's not clear that they would result in an actual performance improvement at scale. Their analysis ignores that a single matrix multiplication is typically used to calculate the Q, K, and V values from the inputs. The third change looks like it would break causal masking for auto regressive language models. For masked token language models and ViTs, perhaps it's an improvement, though. |
|