| HN Mirror

Maybe I am not understanding your point.

Out of the box, given a sequence of n tokens, BERT returns a tensor of dimension (n_tokens, hidden_size) [1]. Where hidden size has no relationship with the vocabulary. You can then fine-tune a model on this representation to do various tasks, e.g. sentiment classification. Thus BERT is said to be a language representation model.

Out of the box, given a sequence, GPT-2 returns a distribution over the vocabulary [2] from which you can draw to find the most likely next word. Thus GPT-2 is said to be a language generation model.

You could of course play with the masking token of BERT call it recursively to force BERT to generate something, and you could chop off some layers of GPT-2 to get some representation of your input sequence, but I think that is a little past the original question.

[1] https://github.com/google-research/bert/blob/master/modeling...

[2] https://github.com/openai/gpt-2/blob/master/src/model.py#L17...