Competing in a data science contest without reading the data

Y	Hacker News new \| ask \| show \| jobs

	Competing in a data science contest without reading the data (blog.mrtz.org)
	37 points by urish 4084 days ago

4 comments

disgruntledphd2 4084 days ago

This is actually a really, really good article. I like the way that the author both writes clearly and entertainingly about a reasonably complex topic.

The disconnect between static and interactive data analysis that is at the heart of the post is probably the most ignored issue in science.

To be honest, its hard not to ignore it given the implications of it (that we only get one shot at a set of test/validation/experimental data) and if we mess up, we're screwed.

link

sgt101 4084 days ago

No theory (random classifiers) aggregated to optimize on a non representative hold out set form a theory on that set? I think this is expected. If you create classifiers that express some domain theory on the training set in step 1. and use the information in the hold out differently you'll do a lot better (I believe - well I think I saw that result when I did my Ph.D 17 years ago).

Here is a very bad, very bad, very old, very old, AAAI workshop paper that sums up the idea (the journal paper is behind a pay wall.

http://aaaipress.org/Papers/Workshops/1999/WS-99-06/WS99-06-...

link

kastnerkyle 4083 days ago

This paper [1] by Bergstra, Cox is one of my favorites on competing without looking at the data. They were actually able to design the model before the data was even released (!)

[1] http://arxiv.org/pdf/1306.3476v1.pdf

link

louden 4081 days ago

This article illustrates the problem with over-fitting a model even when some data is withheld for testing. This is a trap that one can fall into when using training and testing sets.

link