Hacker News new | ask | show | jobs
by sojuz151 758 days ago
>My intuition is that as contexts get longer we start hitting the limits of how much comprehension can be embedded in a single point of vector space, and will need better architectures for selecting the relevant portions of the context.

We are dealing with multi-headed attention, therefore we have multiple points per token. You can always increase the number of heads or the size of the key vector.

1 comments

The token embedding is what ultimately gets nudged around by the heads though, right? The key vector just relates to the context size, not the token embedding size, afaik.