| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by EGreg 544 days ago
	Can you go into detail for those of us who aren't as well versed in the tech? What do the encoders do vs the decoders, in this ecosystem? What are some good links to learn about these concepts on a high level? I find all most of the writing about different layers and architectures a bit arcane and inscrutable, especially when it comes to Attention and Self-Attention with multiple heads.

2 comments

cubie 544 days ago

On a very high level, for NLP:

1. an encoder takes an input (e.g. text), and turns it into a numerical representation (e.g. an embedding).

2. a decoder takes an input (e.g. text), and then extends the text.

(There's also encoder-decoders, but I won't go into those)

These two simple definitions immediately give information on how they can be used. Decoders are at the heart of text generation models, whereas encoders return embeddings with which you can do further computations. For example, if your encoder model is finetuned for it, the embeddings can be fed through another linear layer to give you classes (e.g. token classification like NER, or sequence classification for full texts). Or the embeddings can be compared with cosine similarity to determine the similarity of questions and answers. This is at the core of information retrieval/search (see https://sbert.net/). Such similarity between embeddings can also be used for clustering, etc.

In my humble opinion (but it's perhaps a dated opinion), (encoder-)decoders are for when your output is text (chatbots, summarization, translation), and encoders are for when your output is literally anything else. Embeddings are your toolbox, you can shape them into anything, and encoders are the wonderful providers of these embeddings.

link

SoothingSorbet 544 days ago

I still find this explanation confusing because decoder-only transformers still embed the input and you can extract input embeddings from them.

Is there a difference here other than encoder-only transformers being bidirectional and their primary output (rather than a byproduct) are input embeddings? Is there a reason other than that bidirectionality that we use specific encoder-only embedding models instead of just cutting and pasting a decoder-only model's embedding phase?

link

craigacp 544 days ago

The encoder's embedding is contextual, it depends on all the tokens. If you pull out the embedding layer from a decoder only model then that is a fixed embedding where each token's representation doesn't depend on the other tokens in the sequence. The bi-directionality is also important for getting a proper representation of the sequence, though you can train decoder only models to emit a single embedding vector once they have processed the whole sequence left to right.

Fundamentally it's basically a difference between bidirectional attention in the encoder and a triangular (or "causal") attention mask in the decoder.

link

Kinrany 544 days ago

How much does the choice of the encoder depend on the application?

link

janalsncm 544 days ago

If you’re interested in learning more, the linked article isn’t a bad place to start.

link