|
|
|
|
|
by abiro
2127 days ago
|
|
OpenAI would naturally optimize for the tests published by Marcus as a critique of GPT-2, yet GPT-3 still fails physical reasoning spectacularly (the one test needing casual reasoning the most). There are two broader points here: 1. The lack of independently verifiable evaluation metrics for these type of models should make everyone very skeptical. (Who can afford to retrain GPT-3 from scratch?) 2. I find it difficult to believe that smart people still insist that a model incapable of representing causal relationships can produce intelligent answers. |
|
It would be difficult for them to do so since Marcus's GPT2 critique came out after they collected the dataset for GPT3.
Marcus's article: Jan 2020
GPT-3 dataset: "Table 2.2 shows the final mixture of datasets that we used in training. The CommonCrawl data was downloaded from 41 shards of monthly CommonCrawl covering 2016 to 2019"