| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by srush 1747 days ago
	Yes there are many reproducible measures for benchmarking NLP datasets. We use many of them in the paper. The issue here is that we were not completely sure of the process that OpenAI used in their paper. They report the prompt but not the process of finding it. As their model and process is proprietary, it is hard for us to do an apples-to-apples comparison. This small experiment though indicates that it is likely not very robust to prompt wording.