Hacker News new | ask | show | jobs
by vladf 1116 days ago
> Also be careful that GPT-4/ 3.5's performance on GSM8K is not true few-shot -- in GPT-4 report they said that they mixed a portion of GSM8K training set to train the model

It'd be really valuable to have "fuzzed" versions of these benchmarks, where you replace quantities in the questions with randomly-sampled values, so that this wasn't a concern. Of course, then the score would itself be a random variable, but you could just return an interval.

1 comments

seeing identical problems with different values still doesn't count as zero shot. it is better though, for sure