| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by thraway180306 2860 days ago
	PCA is dimensionally invalid, it destroys, not preserves structure and consists of arbitrary linear algebra operations. It is "less arbitrary" the way x86 assembly is "less arbitrary" wrt. C (actually it ties you to a certain mode of thinking).

2 comments

mturmon 2859 days ago

I don't think "arbitrary linear algebra operations" is a valid critique. If you understand PCA as "take the SVD of the data", then the operations seem arbitrary. But if you understand it as, "construct a low-rank approximation in the L2 sense to the data, or its covariance", then it's not.

Also, I don't think that the (very legitimate) "dimensional" critique of PCA applies here. The units on the coordinates of the representation are the same: the presence or absence of that prime factor.

To the original question: my suspicion is that PCA might pull out the even numbers (first PC) and the divisible by 3 numbers (second PC), because these two factors may explain the most variability in the underlying vector representation. If it did, that would be pretty intuitive, although not as interesting.

---

Edited to add: Suspicion turned out to be true. For the first 2000 integers, the top 6 PCs turned out to correspond to the first 6 primes (2, 3, 5, 7, 11, 13).

Plot at: https://imgur.com/a/qi2Sx5u?

  function [nums,pcs]=pca_prime(nMax,nPC)

  nums = zeros(nMax, nMax);
  for k = 2:nMax,
    nums(k,factor(k)) = 1; % vector representation of "k"
  end;
  % 2:end because don't care about 1 as a "prime"
  pcs = pca(nums(2:end,:), 'NumComponents', nPC);

  [nums,pcs]=pca_prime(2000,10); % "svd" would work too
  plot(pcs(:,1:6)); % first 6 PCs

link

murbard2 2859 days ago

If you think of the covariance matrix, entry i,j for i ≠ j will be

   floor(n / (p[i]*p[j])) / n - floor(n / p[i]) * floor(n/p[j]) / n^2

and the ith diagonal entry will be

   floor(n / p[i]) / n - ( floor(n / p[i]) / n )^2

for n large, you approximately get a diagonal matrix with diagonal entries / eigenvalues 1/p[i] - 1/p[i]^2.

link

mturmon 2859 days ago

Smart observation. Another way to say it is that, for distinct primes p1 and p2, the events “p1 divides n”, and “p2 divides n”, are approximately statistically independent. So you get a near-diagonal covariance with entries as you wrote.

link

jacobolus 2859 days ago

> If you understand PCA as "take the SVD of the data", then the operations seem arbitrary. But if you understand it as, "construct a low-rank approximation in the L2 sense to the data, or its covariance", then it's not.

Those are the same thing. Is your first case just a reader who has no idea what the SVD is?

link

mturmon 2859 days ago

Yes, they are both descriptions of the same thing. I'm trying to say that PCA does have a justification. It's not just an "arbitrary linear algebra operation", although the application of the SVD algorithm to perform PCA can be presented that way.

link

QML 2859 days ago

Are you saying the kth principal component is equal to the kth prime number?

link

mturmon 2859 days ago

Yes. See the plot: for example, PC #1 is essentially a 0/1 vector with all its weight (1 in this case) placed on the "2", which is the representation of the prime number 2 in the scheme used in the OP.

link

OscarCunningham 2859 days ago

> dimensionally invalid

What does this mean?

link

thraway180306 2859 days ago

Without (arbitrary) scaling normalization PCA gives different results with change of dimensional units, that is principal components depend on choice of units or scaling.

link

OscarCunningham 2859 days ago

Oh okay. I did know about that weakness of PCA. But it seems like in this case letting one prime factor correspond to one unit of distance is a nonarbitrary scaling, so I would expect PCA to give sensible results.

link