That's how they are trained initially, but the resulting model isn't all that useful (was SOTA two years ago but this field moves fast).
A lot of the utility comes from the later finetuning. You can see this using the examples from the article, every mistake they identify with GPT-3 (which is the unfinetuned version) is answered correctly by chatGPT, which has gone through an extensive finetuning process called RLHF.
That's how the text decoder works, but the model gets to define "most likely" and an RLHF model uses this to make the text decoder produce useful answers instead.