| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by djoldman 1704 days ago

Hi stella. Given this paragraph in the paper:

> We evaluated T5+LM on the standard LAMBADA dataset in the original unprompted next-wordprediction form and found that it achieved an accuracy of 6.2%. This is substantially below the accuracy of 72.5% achieved by the comparably-sized GPT-3-13B variant. T0 did not fare much better, achieving only 18.7%. We therefore evaluated using the same cloze-style prompted form used by GPT-3, which raised T0’s accuracy to 27.8%. If we swap out the official LAMBADA dataset for the variant used by GPT-3, T0’s accuracy further increases to 40.5% and T5+LM achieves 10.7%. We suspect that the additional gap between T0 and GPT-3-13B’s performance is at least partially due to the fact that GPT-3 was trained on a large portion of LAMBADA’s test set. Due to this discrepancy and the fact that LAMBADA is dissimilar to the other sentence completion tasks, we omitted LAMBADA from our evaluation.

I had two questions:

1. Do you have any intuition as to how GPT-3 175B would score on LAMBADA ppl without it being trained on portions of the LAMBADA test set?

2. It's encouraging to see such high marks on these language tasks. Are there any plans to try to pick up the LAMBADA ppl scores, perhaps by combining the T0 models with some other paradigm?

1 comments

craffel 1704 days ago

(different author, not Stella)

To your first question: Unpublished experiments done by the BigScience architecture and scaling WG suggest that training on book corpus yields a boost of 10-15% accuracy on LAMBADA.

To your second question: LAMBADA specifically is an interesting task, but it's a bit unsatisfying to work on since there are so many conflating factors in prior work on the dataset. We are planning quite a few follow-up projects along this general line of work (prompted multi-task training), though.

link