"BERT returns" is ambiguous here. During pretraining last layer is loggits for one hot vocab vector, the same as in GPT: https://github.com/google-research/bert/blob/master/run_pret...