|
|
|
|
|
by vladf
1116 days ago
|
|
> Also be careful that GPT-4/ 3.5's performance on GSM8K is not true few-shot -- in GPT-4 report they said that they mixed a portion of GSM8K training set to train the model It'd be really valuable to have "fuzzed" versions of these benchmarks, where you replace quantities in the questions with randomly-sampled values, so that this wasn't a concern. Of course, then the score would itself be a random variable, but you could just return an interval. |
|