Hacker News new | ask | show | jobs
by stellaathena 1706 days ago
[Disclaimer: I am an author of the above paper and played a rather minimal role. I am also a prominent member of EleutherAI.]

"Instruction-tuning" is clearly in the air. Simultaneous work at Google (released less than two weeks ago) on a model they call FLAN can be found here: https://ai.googleblog.com/2021/10/introducing-flan-more-gene...

EleutherAI attempted to do something similar several months ago, but didn't succeed: https://blog.eleuther.ai/tuning-on-eval-harness/

A careful analysis of the similarities and differences between the three approaches would be likely highly beneficial to the community.

4 comments

Hi stella. Given this paragraph in the paper:

> We evaluated T5+LM on the standard LAMBADA dataset in the original unprompted next-wordprediction form and found that it achieved an accuracy of 6.2%. This is substantially below the accuracy of 72.5% achieved by the comparably-sized GPT-3-13B variant. T0 did not fare much better, achieving only 18.7%. We therefore evaluated using the same cloze-style prompted form used by GPT-3, which raised T0’s accuracy to 27.8%. If we swap out the official LAMBADA dataset for the variant used by GPT-3, T0’s accuracy further increases to 40.5% and T5+LM achieves 10.7%. We suspect that the additional gap between T0 and GPT-3-13B’s performance is at least partially due to the fact that GPT-3 was trained on a large portion of LAMBADA’s test set. Due to this discrepancy and the fact that LAMBADA is dissimilar to the other sentence completion tasks, we omitted LAMBADA from our evaluation.

I had two questions:

1. Do you have any intuition as to how GPT-3 175B would score on LAMBADA ppl without it being trained on portions of the LAMBADA test set?

2. It's encouraging to see such high marks on these language tasks. Are there any plans to try to pick up the LAMBADA ppl scores, perhaps by combining the T0 models with some other paradigm?

(different author, not Stella)

To your first question: Unpublished experiments done by the BigScience architecture and scaling WG suggest that training on book corpus yields a boost of 10-15% accuracy on LAMBADA.

To your second question: LAMBADA specifically is an interesting task, but it's a bit unsatisfying to work on since there are so many conflating factors in prior work on the dataset. We are planning quite a few follow-up projects along this general line of work (prompted multi-task training), though.

Just want to say thanks for taking the time to put the model on HuggingFace! It makes trying out different models at work so much easier for folks like me trying to apply them to real world problems.
Just in case this question isn't to far out of your way. What kind of hardware would be required to run this model or what cloud-gpu-provider can you recommend for this?
from @craffel: It's possible to run inference on a single Google Cloud TPU v3-8 device or on a server with 4x 32GB v100 GPUs. Hugging Face also has an inference API for any model on the Hub: https://api-inference.huggingface.co/docs/python/html/index....
Thank you for this! Could you or anyone available please explain how to get it to generate javascript like with GPT-3? For example, with gpt-3 you can just ask it to "generate a javascript code that collects all the links on the page," but that does not work with the demo prompt on hugging face.

Does it allow training prompts or is that done through more fine tuning in this model?

Code generation is not supported due to the tokenization strategy.