Hacker News new | ask | show | jobs
by d3m0t3p 577 days ago
Increasing the dimension causes a lot more computation, this is one of the main reason. You can see evidence of this in the multi head where the dim is reduced via a linear projection.

h_i = attention(W_i^Q Q^T @ W_i^K K) W_i^v V

h = W_o @ concat(h_1...h_8)

1 comments

How many dimensions would you need to increase by to capture positional information?

Seems to me like it’d be a quite low number compared to the dimensionality of the semantic vectors?