|
|
|
|
|
by djoldman
1704 days ago
|
|
Hi stella. Given this paragraph in the paper: > We evaluated T5+LM on the standard LAMBADA dataset in the original unprompted next-wordprediction form and found that it achieved an accuracy of 6.2%. This is substantially below the
accuracy of 72.5% achieved by the comparably-sized GPT-3-13B variant. T0 did not fare much
better, achieving only 18.7%. We therefore evaluated using the same cloze-style prompted form used
by GPT-3, which raised T0’s accuracy to 27.8%. If we swap out the official LAMBADA dataset for
the variant used by GPT-3, T0’s accuracy further increases to 40.5% and T5+LM achieves 10.7%.
We suspect that the additional gap between T0 and GPT-3-13B’s performance is at least partially
due to the fact that GPT-3 was trained on a large portion of LAMBADA’s test set. Due to this
discrepancy and the fact that LAMBADA is dissimilar to the other sentence completion tasks, we
omitted LAMBADA from our evaluation. I had two questions: 1. Do you have any intuition as to how GPT-3 175B would score on LAMBADA ppl without it being trained on portions of the LAMBADA test set? 2. It's encouraging to see such high marks on these language tasks. Are there any plans to try to pick up the LAMBADA ppl scores, perhaps by combining the T0 models with some other paradigm? |
|
To your first question: Unpublished experiments done by the BigScience architecture and scaling WG suggest that training on book corpus yields a boost of 10-15% accuracy on LAMBADA.
To your second question: LAMBADA specifically is an interesting task, but it's a bit unsatisfying to work on since there are so many conflating factors in prior work on the dataset. We are planning quite a few follow-up projects along this general line of work (prompted multi-task training), though.