Hacker News new | ask | show | jobs
by rcar 1867 days ago
PCA is a cool technique mathematically, but in my many years of building models, I've never seen it result in a more accurate model. I could see it potentially being useful in situations where you're forced to use a linear/logistic model since you're going to have to do a lot of feature preprocessing, but tree ensembles, NNs, etc. are all able to tease out pretty complicated relationships among features on their own. Considering that PCA also complicates things from a model interpretability point of view, it feels to me like a method whose time has largely passed.
6 comments

> Considering that PCA also complicates things from a model interpretability point of view

This is a strange comment since my primary usages of PCA/SVD is as a first step in understanding latent factors which are driving the data. Latent factors typically involve all of the important things that anyone running a business or deciding policy care about: customer engagement, patient well being, employee hapiness, etc all represent latent factors.

If you have ever wanted to perform data analysis and gain some exciting insight into explaining user behavior, PCA/SVD will get you there pretty quickly. It is one of the most powerful tools in my arsenal when I'm working on a project that requires interoperability.

The "loadings" in PC and the V matrix in SVD both contain information about how the original feature space correlates with the new projection. This can easily show thing things like "User's who do X,Y and NOT Z are more likely to purchase".

Likewise in LSA (Latent Semantic Analysis/indexing) on a Term-Frequency matrix you will get a first pass at semantic embedding. You'll notice, for example, that "dog" and "cat" will project onto the new space in a common PC which can be used to interpret "pets".

> I've never seen it result in a more accurate model. I could see it potentially being useful in situations where you're forced to use a linear/logistic model

PCA/SVD are a linear transformation of the data and shouldn't give you any performance increase on a linear model. However they can be very helpful in transforming extremely high dimensional, sparse vectors into lower dimensional, dense representations. This can provide a lot of storage/performance benefits.

> NNs, etc. are all able to tease out pretty complicated relationships among features on their own.

PCA is literally identical to an autoencoder minimizing the MSE with no non-linear layers. It is a very good first step towards understanding what your NN will eventually do. After all, all NNs perform a non-linear matrix transformation so that your final vector space is ultimately linearly separable.

Sure, everyone wants to get to the latent factors that really drive the outcome of interest, but I've never seen a situation in which principal components _really_ represent latent factors unless you squint hard at them and want to believe. As for gaining insight and explaining user behavior, I'd much rather just fit a decent model and share some SHAP plots for understanding how your features relate to the target and to each other.

If you like PCA and find it works in your particular domains, all the more power to you. I just don't find it practically useful for fitting better models and am generally suspicious of the insights drawn from that and other unsupervised techniques, especially given how much of the meaning of the results gets imparted by the observer who often has a particular story they'd like to tell.

I've used PCA with good results in the past. My problem essentially simplified down to trying to find nearest neighbours in high dimensional spaces. Distance metrics in high dimensional spaces don't behave nicely. Using PCA to cut reduce the number of dimensions to something more manageable made the problem much more tractable.
Plenty of examples for these in finance and economics (term structure, asset pricing factors).
By definition there are more accurate models, the PCA is kind of like a general lossy compression algorithm. Any model you come up with can be superseded by a more accurate model up until you have a perfect description of a phenomenon, but PCA is a well understood technique, can be computed very fast using optimized algorithms and GPUs and pretty much anyone can easily understand PCA and apply it to a wide variety of problems, and from a technical standpoint the ratio of output bits to input bits preserves the maximum amount of information.

We use PCA quite a lot at my quant firm do something similar to clustering in high dimensional spaces. A simple use case would be to arrange stocks so that stocks that move similarly to one another are grouped close together.

Another use case for PCA is breaking stocks down into constituent components, for example being able to express the price of a stock as a linear combination of factors: MSFT = 5% oil + 10% interest rates + 40% tech sector + ...

You can also do this for things like ETFs, where in principle an ETF is potentially made up of 100 stocks, but in practice only 10 of those stocks really determine the price, so if you're engaged in ETF market making you can hold neutral portfolio by carrying the ETF long and a small handful of stocks short.

By definition, it's going to result in a less accurate model, unless you keep all of the dimensions or your data is very weird, right? And NNs are going to complicate your interpretability more?
When/if used properly, no. The idea behind PCA is to find a set of features with far less dimensionality than the original data. The hope/intent with this sort of approach is that any more fitted features are just fitting noise.
For people who are curious, the GP is correct when it comes to fitting the training data. Recall, with enough parameters, we can get 100% on training. The parent’s comment is about testing/validation where we want to avoid overfitting so removing the least important parameters can be helpful.
Not if many columns in your data are driven by some common latent factors.
PCA is good enough for a lot of things. For example, it is used in genetics to measure relatedness between populations reasonably well. A perfect model doesn't really exist when the data you are able to realistically collect is only a subset of the population anyway, perhaps biased toward how it was collected.
i can think of a few places where it's useful:

if you know that your data comes from a stationary distribution, you can use it as a compression technique which reduces the computational demands on your model. sure, computing the initial svd or covariance matrix is expensive, but once you have it, the projection is just a matrix multiply and a vector subtraction. (with the reverse being the same)

if you have some high dimensional data and you just want to look at it, it's a pretty good start. not only does it give you a sense for whether higher dimensions are just noise (by looking at the eigenspectrums) it also makes low dimensional plots possible.

pca, cca and ica have been around for a very long time. i doubt "their time has passed."

but who knows, maybe i'm wrong.

It is still a nice tool for projecting things (at least to visualize) where you expect the data to be on a lower dimensional hyperplane. I do agree in most cases t-SNE or UMAP are better (esp if you don’t care about distances).