|
|
|
|
|
by obastani
535 days ago
|
|
It's not about training directly on the test set, it's about people discussing questions in the test set online (e.g., in forums), and then this data is swept up into the training set. That's what makes test set contamination so difficult to avoid. |
|
That is the "reality" - that because companies can train their models on the whole Internet, companies will train their (base) models on the entire Internet.
And in this situation, "having heard the problem" actually serves as a barrier to understanding of these harder problems since any variation of known problem will receive a standard "half-assed guestimate".
And these companies "can't not" use these base models since they're resigned to the "bitter lesson" (better the "bitter lesson viewpoint" imo) that they need large scale heuristics for the start of their process and only then can they start symbolic/reasoning manipulations.
But hold-up! Why couldn't an organization freeze their training set and their problems and release both to the public? That would give us an idea where the research stands. Ah, the answer comes out, 'cause they don't own the training set and the result they want to train is a commercial product that needs every drop of data to be the best. As Yan LeCun has said, this isn't research, this is product development.