Attention is all you need paper just proposed an AR model that didn’t have to be trained step by step. The scaling happened later in BERT and GPT and OpenAI’s scaling work