| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by vladislav 2338 days ago
	They would have able to win and get away with it if they incorporated the knowledge of the external dataset directly into the ML model, provided they had a reasonable estimate on the fraction of overlap between the external data and the test set. A weak version of this would be to just train on the external data in addition to the provided data. A stronger version would train regularly on the provided training data and in addition overfit on a random subset of some percentage of the external data (with some small random prediction error thrown in to obfuscate), which would get equivalent results to what they did with logic.

4 comments

rahimnathwani 2338 days ago

"A weak version of this would be to just train on the external data in addition to the provided data."

In this competition, the training code was run on Kaggle's system, so you'd still need to smuggle in the extra data.

link

m3kw9 2338 days ago

Part of the reason h2o.ai fired him? He was a cheat, ok, but you also cheated so stupidly

link

alanfranz 2338 days ago

This.

You've got the testing set. Create random HPs and tune them to fit. The way they cheated is stupid.

And the way the testing set can be obtained is silly.

link

oakhaven 2338 days ago

This is a really good point!

Considering the guy was smart (he is kaggle grandmaster), I would really like to know what prevented him from training on the scraped data, and what motivated him to obfuscate the known sample lookup.

Maybe there's some technicality they made it impossible to tune the model on the additional scraped training data.

link