| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by VHRanger 3160 days ago
	> There is so much focus on the machine learning algorithms rather than getting data ready for the algorithms. Generally, once a problem at work has come to the point of being a "kaggle problem", it's trivially easy. The main problem is unstructured data, with infinite ways of specifying similar ways to measure the same attribute, and lots of leeway to build an unmaintainable data pipeline between the data generation process and the model at the end.

2 comments

benhamner 3160 days ago

All Kaggle problems aren't created equal. Some look like a train matrix, a single target, and a test matrix.

Others are far more complex and start with much messier data and/or complex formulations.

Examples:

- www.kaggle.com/c/nips-2017-non-targeted-adversarial-attack/ - www.kaggle.com/c/the-allen-ai-science-challenge

link

sidlls 3160 days ago

I disagree that a "kaggle problem" style problem is trivially easy, but I strongly agree with the sentiment that dealing with unstructured data is often a much bigger, deeper, and broader problem than the choice of a particular algorithm or ensemble of them.

The ability to efficiently and effectively derive insights from such data is scarce.

link

VHRanger 3160 days ago

Right, by "kaggle problem" I mean the general case where we roughly know what we're going to want to have on the right hand side of the model we're going to run (plus or minus some feature engineering, model choice and other hyperparameter specification, etc.)

link