|
|
|
|
|
by SoothingSorbet
541 days ago
|
|
I still find this explanation confusing because decoder-only transformers still embed the input and you can extract input embeddings from them. Is there a difference here other than encoder-only transformers being bidirectional and their primary output (rather than a byproduct) are input embeddings? Is there a reason other than that bidirectionality that we use specific encoder-only embedding models instead of just cutting and pasting a decoder-only model's embedding phase? |
|
Fundamentally it's basically a difference between bidirectional attention in the encoder and a triangular (or "causal") attention mask in the decoder.