|
|
|
|
|
by vatican_banker
1881 days ago
|
|
>If I had one piece of advice to give on the subject it would be PCA the crap out of everything and understand what the top components are doing There are at least four issues with this advice (wrt doing data analysis): 1. How do you link your PCA components to the original data? Let's say you are tasked to find the main drivers of sales on a given city. You run PCA on the data and find two main components on the dataset. What do you do next? How do you make this information actionable? 2. How do you treat categorical variables? There are PCA methods for dealing with categorical variables but by the time you apply these methods plus the issues in 1) your data has lost all actionable meaning. 3. PCA is _very_ difficult to explain to business stakeholders. The more difficulty business stakeholders have to understand the analysis, the less they will use it. 4. Data-driven business stakeholders will favour clarity and simplicity over sophistication (somewhat linked to 3) |
|
> you are tasked to find the main drivers of sales on a given city. You run PCA on the data and find two main components on the dataset
Obviously it depends what comes out. But in all likelihood you will see some significant clusterings / divisions in PC1 and PC2, so you will try to interpret what properties of the points are driving those. You can do it in a data driven way (what are the significant coefficients in principal component vectors) or you can often do it in an exploratory way ... are they related to geography, are they related to age demographics, are they seasonal ... you color the data points by different possible explanatory variables to see what group things together. And you will very likely see things jump out (eg: you could find that the main reason a particular month was down in sales was due to a technical problem with the web site and you'll want to put that aside, because it doesn't have any predictive value).