| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by d3m0t3p 577 days ago

Increasing the dimension causes a lot more computation, this is one of the main reason. You can see evidence of this in the multi head where the dim is reduced via a linear projection.

h_i = attention(W_i^Q Q^T @ W_i^K K) W_i^v V

h = W_o @ concat(h_1...h_8)

1 comments

bfelbo 577 days ago

How many dimensions would you need to increase by to capture positional information?

Seems to me like it’d be a quite low number compared to the dimensionality of the semantic vectors?

link