Hacker News new | ask | show | jobs
by Homunculiheaded 5052 days ago
If you're someone who's interested in ML/datamining but haven't had a chance to put your ideas to any hard/interesting problems I strongly recommend a kaggle contest. It's one thing to plug some data into a random forest and go "oh cool, I guess that did okay" and entirely another to see how other competitors are comparing.

one of the biggest challenges I've found in implementing ML projects is I don't have a great sense of when I've really gotten the most info out of the data. I'm not particularly competitive but the contest format is great for this. When you see that a solution you'd normally be happy with ranks in the lower half of the answer you're really pushed to improve your solution.

This is leads you to learn your tools and algorithms better. For a couple of contests I took seriously I ended up learning tons about R, spent most of my nights reading academic papers on various newer techniques, and also read through a few books. On top of all that you really should spend time reading up on how past winner have won which gives a bunch of practical insight into approaching different ML problems.

In one contest I tried the hardest in I actually placed terribly after the final results were calculated, but looking over what went wrong I was amazed to see that I actually did progress really far with my understanding of ml. I'd say a month of seriously competing is easily worth a semester long grad class.

2 comments

I am really interesting in ML, but have only recently been diving into it. I have watched all of the videos for Andrew Ng's Coursera course (and most of the programming exercises), but just looking over some of the Kaggle contests I think I would be quickly out of my depth.

Would I be wasting my time attempting these with such a basic level of knowledge?

If you read through the past winners you'll find that in many cases a very simple model will win. I believe one of the winners that posted a blog post had pretty much the background you describe.

When I started I was in a similar position to you and just wanted to see if I could even tread water with some of the really knowledgeable members of the community. I ended up placing in the top 5 for one of the contests I was in (with btw a really simple model).

They usually give you some starter code in either R or Python which will give you the results for a benchmark, start there and then use cross-validaton to see if you can beat that bench mark, and if you do submit. It's very addictive and you'll come away knowing a lot more than you started with.

Awesome, thanks for the info. I am checking out some of the benchmarks now.

Why do you think it is that simple models often win? Is it due to the experts no participating or is there a lot more low-hanging fruit than I previously thought? Or just that simple models are easier to use and reason with for humans and thus easier to get right.

If you look at some of the bios of the top 50 kaggler's there's some pretty impressive backgrounds there, and they participate heavily. So I don't think that's the reason.

I know for my own beginner mistakes, it's a big error to try something out-of-the-box and immediately try to get better cv scores by creating much more complicated solutions.

The truth is a lot of work has been put into any standard implementation of an SVM, RandomForest etc (and even more work has been put into the theory behind those algorithms). Since I haven't come in 1st in any competition and am not a ML expert I don't think I can give you the correct strategy to win. But I can say as a general trend, all of my attempts to create non-standard complicated models did terribly, and many of the decisions I made based on research into fixing a particular problem in a known solution seemed to be better performing (i.e. "How to deal with imbalanced classification problems with a RF?" type questions)

I work at Kaggle.

In many cases where simple models win, there's some insight into the data that the winner found - engineered a new feature, or noticed a pattern and appropriately tuned a particular method. Where those insights exist, they often overshadow any gains by super-sophisticated ML techniques.

Makes sense. Thanks!
You might not do particularly well (or you might!), but diving into a Kaggle competition is one of the better ways to learn a lot about machine learning very quickly.
This is a great idea! I just signed up.

So let me ask you, where do you read up on how past winners have won?

How did you decide on algorithms to try out on a contest? How did you find promising academic papers?

Kaggle has blog posts of many of the past winners: http://blog.kaggle.com/category/dojo/

For algorithms, just try whatever you know best/is fastest to implement. If you're using R I highly recommend the Caret package.

For papers: the best place to get started is to begin browsing the forums or any similar contests, the community there is actually pretty awesome and will frequently post papers. After that google scholar (or even just google) for particular problems will yield nice results.

Also checkout the wiki: http://www.kaggle.com/wiki/Home