|
|
|
|
|
by SilenN
145 days ago
|
|
Simply, it's when your output embedding matrix = input. You save vocab_dim*model_dim params (ex. 617m for GPT-3). But the residual stream means that the weight matrices are roughly connected via a matmul, which means they struggle to encode bigrams (commutative property enforces symmetry). Attention + MLP adds nonlinearity, but it still means less expressivity. Which is why they aren't SOTA, but are useful in smaller models. |
|