|
|
|
|
|
by rfoo
732 days ago
|
|
I agree. However, it is not a clear cut what's fair and what is "gaming the benchmark" in this setup, for example: - can I train on my own private training set (which is harder)? - can I pretrain on The Pile or something similar, a dataset full of texts crawled from web? - can I pretrain on elementary school textbooks? It seems like the latter two is acceptable given the use of GPT-4o here. But then, are the latter two that different to the first one? GPT-4o have the public test set in its training data (GPT-4o is definitely trained on public GitHub repos). What's the point of having a training set with different distribution in this case, other than making participating harder? Maybe it's to discourage data-hungry approaches, but if there are legit shortcuts, anyone who seriously want to win would take it. |
|