Hacker News new | ask | show | jobs
by vatican_banker 1881 days ago
>If I had one piece of advice to give on the subject it would be PCA the crap out of everything and understand what the top components are doing

There are at least four issues with this advice (wrt doing data analysis):

1. How do you link your PCA components to the original data? Let's say you are tasked to find the main drivers of sales on a given city. You run PCA on the data and find two main components on the dataset. What do you do next? How do you make this information actionable?

2. How do you treat categorical variables? There are PCA methods for dealing with categorical variables but by the time you apply these methods plus the issues in 1) your data has lost all actionable meaning.

3. PCA is _very_ difficult to explain to business stakeholders. The more difficulty business stakeholders have to understand the analysis, the less they will use it.

4. Data-driven business stakeholders will favour clarity and simplicity over sophistication (somewhat linked to 3)

2 comments

The goal isn't to present PCA directly to stakeholders. It is something you do at the start to understand your data. The premise is that it is extremely likely that there are significant batch effects or other statistical correlations in the data that you are probably unaware of at the outset. You should aim to discover these early on. To do this you need to use an unsupervised method because the whole point is you don't what they are.

> you are tasked to find the main drivers of sales on a given city. You run PCA on the data and find two main components on the dataset

Obviously it depends what comes out. But in all likelihood you will see some significant clusterings / divisions in PC1 and PC2, so you will try to interpret what properties of the points are driving those. You can do it in a data driven way (what are the significant coefficients in principal component vectors) or you can often do it in an exploratory way ... are they related to geography, are they related to age demographics, are they seasonal ... you color the data points by different possible explanatory variables to see what group things together. And you will very likely see things jump out (eg: you could find that the main reason a particular month was down in sales was due to a technical problem with the web site and you'll want to put that aside, because it doesn't have any predictive value).

Shameless plug: I previously wrote up a blog post on how to use an unsupervised feature selection analog of PCA to avoid many of the issues you point out here, and an associated python package to carry it out ("linselect", which you can pip install):

https://www.efavdb.com/unsupervised-feature-selection-in-pyt...