Hacker News new | ask | show | jobs
by Homunculiheaded 5052 days ago
If you read through the past winners you'll find that in many cases a very simple model will win. I believe one of the winners that posted a blog post had pretty much the background you describe.

When I started I was in a similar position to you and just wanted to see if I could even tread water with some of the really knowledgeable members of the community. I ended up placing in the top 5 for one of the contests I was in (with btw a really simple model).

They usually give you some starter code in either R or Python which will give you the results for a benchmark, start there and then use cross-validaton to see if you can beat that bench mark, and if you do submit. It's very addictive and you'll come away knowing a lot more than you started with.

1 comments

Awesome, thanks for the info. I am checking out some of the benchmarks now.

Why do you think it is that simple models often win? Is it due to the experts no participating or is there a lot more low-hanging fruit than I previously thought? Or just that simple models are easier to use and reason with for humans and thus easier to get right.

If you look at some of the bios of the top 50 kaggler's there's some pretty impressive backgrounds there, and they participate heavily. So I don't think that's the reason.

I know for my own beginner mistakes, it's a big error to try something out-of-the-box and immediately try to get better cv scores by creating much more complicated solutions.

The truth is a lot of work has been put into any standard implementation of an SVM, RandomForest etc (and even more work has been put into the theory behind those algorithms). Since I haven't come in 1st in any competition and am not a ML expert I don't think I can give you the correct strategy to win. But I can say as a general trend, all of my attempts to create non-standard complicated models did terribly, and many of the decisions I made based on research into fixing a particular problem in a known solution seemed to be better performing (i.e. "How to deal with imbalanced classification problems with a RF?" type questions)

I work at Kaggle.

In many cases where simple models win, there's some insight into the data that the winner found - engineered a new feature, or noticed a pattern and appropriately tuned a particular method. Where those insights exist, they often overshadow any gains by super-sophisticated ML techniques.

Makes sense. Thanks!