Hacker News new | ask | show | jobs
by jimsimmons 1140 days ago
GP is wrong.

Attention is all you need paper just proposed an AR model that didn’t have to be trained step by step. The scaling happened later in BERT and GPT and OpenAI’s scaling work