|
|
|
|
|
by d3m0t3p
577 days ago
|
|
Increasing the dimension causes a lot more computation, this is one of the main reason. You can see evidence of this in the multi head where the dim is reduced via a linear projection. h_i = attention(W_i^Q Q^T @ W_i^K K) W_i^v V h = W_o @ concat(h_1...h_8) |
|
Seems to me like it’d be a quite low number compared to the dimensionality of the semantic vectors?