|
|
|
|
|
by srush
1700 days ago
|
|
Yes there are many reproducible measures for benchmarking NLP datasets. We use many of them in the paper. The issue here is that we were not completely sure of the process that OpenAI used in their paper. They report the prompt but not the process of finding it. As their model and process is proprietary, it is hard for us to do an apples-to-apples comparison. This small experiment though indicates that it is likely not very robust to prompt wording. |
|