| HN Mirror

That's not really true -- there are multiple ways of approaching this.

One camp says "collect any data that might be relevant, and then begin looking at the data to try to figure out what the hypotheses should be"

The other camp says "formulate a hypothesis, and then find the data you need to test that hypothesis".

The problem with the latter approach in the social sciences -- or any setting with lots of unknown latent variables -- is that it's often possible to find some data set for which a given hypothesis holds with p < 0.05. So whenever there are a lot of latent variables, it makes a lot more sense to construct a high quality data set first, and then start hypothesis testing.

The problem with the former approach is that you really need to know "this set of data is probably really interesting / representative for an entire range of hypotheses about topic X", but that's often not clear from the outset. And it's often the case that for any particular hypothesis, there are lots of other data sets you might know could also be relevant.

In any case, whenever there are lots of unknown latent variables, cherry-picking data sets that confirm your hypothesis is a really good way to lead yourself astray.

My solution is to just avoid working in fields with lots of latent variables, but that has limitations was well :-)