Hacker News new | ask | show | jobs
by vladislav 2338 days ago
They would have able to win and get away with it if they incorporated the knowledge of the external dataset directly into the ML model, provided they had a reasonable estimate on the fraction of overlap between the external data and the test set. A weak version of this would be to just train on the external data in addition to the provided data. A stronger version would train regularly on the provided training data and in addition overfit on a random subset of some percentage of the external data (with some small random prediction error thrown in to obfuscate), which would get equivalent results to what they did with logic.
4 comments

"A weak version of this would be to just train on the external data in addition to the provided data."

In this competition, the training code was run on Kaggle's system, so you'd still need to smuggle in the extra data.

Part of the reason h2o.ai fired him? He was a cheat, ok, but you also cheated so stupidly
This.

You've got the testing set. Create random HPs and tune them to fit. The way they cheated is stupid.

And the way the testing set can be obtained is silly.

This is a really good point!

Considering the guy was smart (he is kaggle grandmaster), I would really like to know what prevented him from training on the scraped data, and what motivated him to obfuscate the known sample lookup.

Maybe there's some technicality they made it impossible to tune the model on the additional scraped training data.