| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by blauwbilgorgel 4317 days ago

Why Machine Learning is Kaggle competitions.

  I used an out-of-the-box algorithm, messed around a bit, 
  and definitely did not make the leaderboard.

Because that is not Kaggle competitions. Nearly everyone on the leaderboard is proficient in data analysis and machine learning. They all tried that out-of-the-box algorithm for their attempt. But they did not give up so easily.

  Understand the business problem
  If you want to predict flight arrival times, what are 
  you really trying to do?

This is not different from Kaggle competitions, this is a tip for performing better in Kaggle competitions. See also the GE Flight Quest: https://www.gequest.com/c/flight Those winners used industry-standard machine learning, optimization techniques, but also creative insights and hunches, like tweaking the target labels:

"A next step is to ask, “What should I actually be predicting?”. This is an important step that is often missed by many – they just throw the raw dependent variable into their favorite algorithm and hope for the best. But sometimes you want to create a derived dependent variable. I’ll use the GE Flight Quest as an example: you don't want to predict the actual time the airplane will land; you want to predict the length of the flight; and maybe the best way to do that is to use the ratio of how long the flight actually was to how long it was originally estimated to be and then multiply that times the original estimate." - Steve Donoho - http://blog.kaggle.com/2014/08/01/learning-from-the-best/

Furthermore, it is entirely clear to everyone that data science in a business setting and in a competitive sport setting is different. To say they are equal, would be to say something like: paintball is equal to being in the military. But to say that Kaggle is not machine learning is to say: paintball requires no marksmanship.

There are some very messy, unwieldy datasets on Kaggle right now. For example the Seizure Detection challenge has many GBs of raw sensor data, from just a few patients. This would require a competitor to clean, understand problem domain, understand evaluation metrics, measure cross validation and put your model in production on your laptop in the evening hours.

The author of that blogpost is invited to team up, with me or others. Let's see if we can use machine learning to improve some pressing issues. I'd also love it if Stripe can host a contest on Kaggle.

1 comments

mkrump 4317 days ago

Completely agree. Everything that is supposedly not addressed in Kaggle actually is, aside from productionizing and monitoring your model in production. Sure it's a simplified less open ended version of what you'll encounter in the real world, but that doesn't mean that many of the core concepts don't translate. It's kind of like saying that you shouldn't do your calculus practice problems, because math in the real world is never so straightforward.

I think the most legitimate knock against Kaggle is that in many business settings there probably isn't much value in improving that extra .00001 (but obviously there are exceptions).

Anyway, I think that that even someone very experienced in machine learning would learn something doing a Kaggle competition, especially if the competition is in an area outside their core expertise.

link