Hacker News new | ask | show | jobs
by g82918 2350 days ago
I don't see much wrong, or how this would be cheating. They produced a winning entry, it should be on the organizer to ensure that their test data set isn't trivially findable. It would be like testing a digit recognizer on the MNIST data set and being surprised when someone just hashes it. A real solution isn't to force opensourcing it is to get better metrics. Maybe add a random component like a GAN to generate potential test data, and see if anything classifies that correctly. In the real world when the metric becomes the target it ceases to be a good metric. So test what you want to test and not just some existing data set.

Edit: I didn't see that the test data was given. See the first reply to this comment.

1 comments

The issue isn't that they found a copy of the test data online (The test input data was provided to them as part of the problem.)

The issue is that they manually labeled the test data, and then pretended they didn't.

The competition objective is to provide an ML solution that produces labels for the test data, showing your work with code (to prove you didn't just hand label the data.)

Instead, they did manually label the data, and hid their manual labels in the id column of that external data source.

Ah, thank you, that part wasn't clear when I was reading it!