|
|
|
|
|
by sparsely
633 days ago
|
|
> based on some data The important thing is that it isn't based on the data you are attempting to analyse. It's fine to use subject matter expertise beforehand to decide on what is appropriate to include or not in your analysis. |
|
Using the same perspective, another way to formulate this discussion is:
1. Look at all the data in the universe.
2. Choose some to examine (using a non-random procedure).
3. From those, employ a variable selection procedure (the article argues against stepwise selection and somewhat for Lasso).
4. Fit a model to the remaining data.
In reality, there are at least 2 variable selections occurring. In the first variable selection (choosing data to examine from the universe of data), we are choosing those variables based on some procedure that is ultimately grounded in data.
This is a cache22: unless you look at all data that exists, you choose some subset based on all data that exists.