|
|
|
|
|
by espadrine
984 days ago
|
|
Token embeddings typically only attend to past token embeddings, not future ones. The reason is to enable significant parallelism during training: a large chunk of text goes through the transformer in a single pass, and its weights are optimized to make its output look like its input shifted by one token (ie. the transformer converts each input token to a predicted next token). However, if the attention weights attended to future tokens, they would strongly use the next token they are given, to predict that next token. So all future tokens are masked out. |
|