| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rfoo 732 days ago

I agree. However, it is not a clear cut what's fair and what is "gaming the benchmark" in this setup, for example:

- can I train on my own private training set (which is harder)?

- can I pretrain on The Pile or something similar, a dataset full of texts crawled from web?

- can I pretrain on elementary school textbooks?

It seems like the latter two is acceptable given the use of GPT-4o here. But then, are the latter two that different to the first one? GPT-4o have the public test set in its training data (GPT-4o is definitely trained on public GitHub repos).

What's the point of having a training set with different distribution in this case, other than making participating harder? Maybe it's to discourage data-hungry approaches, but if there are legit shortcuts, anyone who seriously want to win would take it.