Hacker News new | ask | show | jobs
by sparsely 633 days ago
> based on some data

The important thing is that it isn't based on the data you are attempting to analyse. It's fine to use subject matter expertise beforehand to decide on what is appropriate to include or not in your analysis.

2 comments

The article seems to analyze a statistical practice from a theoretical perspective.

Using the same perspective, another way to formulate this discussion is:

1. Look at all the data in the universe.

2. Choose some to examine (using a non-random procedure).

3. From those, employ a variable selection procedure (the article argues against stepwise selection and somewhat for Lasso).

4. Fit a model to the remaining data.

In reality, there are at least 2 variable selections occurring. In the first variable selection (choosing data to examine from the universe of data), we are choosing those variables based on some procedure that is ultimately grounded in data.

This is a cache22: unless you look at all data that exists, you choose some subset based on all data that exists.

> It's fine to use subject matter expertise beforehand to decide on what is appropriate to include or not in your analysis.

I would say it is more than fine; probably the most important thing.