Why Machine Learning is Kaggle competitions. I used an out-of-the-box algorithm, messed around a bit,
and definitely did not make the leaderboard.
Because that is not Kaggle competitions. Nearly everyone on the leaderboard is proficient in data analysis and machine learning. They all tried that out-of-the-box algorithm for their attempt. But they did not give up so easily. Understand the business problem
If you want to predict flight arrival times, what are
you really trying to do?
This is not different from Kaggle competitions, this is a tip for performing better in Kaggle competitions. See also the GE Flight Quest: https://www.gequest.com/c/flight Those winners used industry-standard machine learning, optimization techniques, but also creative insights and hunches, like tweaking the target labels:"A next step is to ask, “What should I actually be predicting?”. This is an important step that is often missed by many – they just throw the raw dependent variable into their favorite algorithm and hope for the best. But sometimes you want to create a derived dependent variable. I’ll use the GE Flight Quest as an example: you don't want to predict the actual time the airplane will land; you want to predict the length of the flight; and maybe the best way to do that is to use the ratio of how long the flight actually was to how long it was originally estimated to be and then multiply that times the original estimate." - Steve Donoho - http://blog.kaggle.com/2014/08/01/learning-from-the-best/ Furthermore, it is entirely clear to everyone that data science in a business setting and in a competitive sport setting is different. To say they are equal, would be to say something like: paintball is equal to being in the military. But to say that Kaggle is not machine learning is to say: paintball requires no marksmanship. There are some very messy, unwieldy datasets on Kaggle right now. For example the Seizure Detection challenge has many GBs of raw sensor data, from just a few patients. This would require a competitor to clean, understand problem domain, understand evaluation metrics, measure cross validation and put your model in production on your laptop in the evening hours. The author of that blogpost is invited to team up, with me or others. Let's see if we can use machine learning to improve some pressing issues. I'd also love it if Stripe can host a contest on Kaggle. |
I think the most legitimate knock against Kaggle is that in many business settings there probably isn't much value in improving that extra .00001 (but obviously there are exceptions).
Anyway, I think that that even someone very experienced in machine learning would learn something doing a Kaggle competition, especially if the competition is in an area outside their core expertise.