|
|
|
|
|
by stellaathena
1706 days ago
|
|
[Disclaimer: I am an author of the above paper and played a rather minimal role. I am also a prominent member of EleutherAI.] "Instruction-tuning" is clearly in the air. Simultaneous work at Google (released less than two weeks ago) on a model they call FLAN can be found here: https://ai.googleblog.com/2021/10/introducing-flan-more-gene... EleutherAI attempted to do something similar several months ago, but didn't succeed: https://blog.eleuther.ai/tuning-on-eval-harness/ A careful analysis of the similarities and differences between the three approaches would be likely highly beneficial to the community. |
|
> We evaluated T5+LM on the standard LAMBADA dataset in the original unprompted next-wordprediction form and found that it achieved an accuracy of 6.2%. This is substantially below the accuracy of 72.5% achieved by the comparably-sized GPT-3-13B variant. T0 did not fare much better, achieving only 18.7%. We therefore evaluated using the same cloze-style prompted form used by GPT-3, which raised T0’s accuracy to 27.8%. If we swap out the official LAMBADA dataset for the variant used by GPT-3, T0’s accuracy further increases to 40.5% and T5+LM achieves 10.7%. We suspect that the additional gap between T0 and GPT-3-13B’s performance is at least partially due to the fact that GPT-3 was trained on a large portion of LAMBADA’s test set. Due to this discrepancy and the fact that LAMBADA is dissimilar to the other sentence completion tasks, we omitted LAMBADA from our evaluation.
I had two questions:
1. Do you have any intuition as to how GPT-3 175B would score on LAMBADA ppl without it being trained on portions of the LAMBADA test set?
2. It's encouraging to see such high marks on these language tasks. Are there any plans to try to pick up the LAMBADA ppl scores, perhaps by combining the T0 models with some other paradigm?