Hacker News new | ask | show | jobs
by thraway180306 2860 days ago
PCA is dimensionally invalid, it destroys, not preserves structure and consists of arbitrary linear algebra operations. It is "less arbitrary" the way x86 assembly is "less arbitrary" wrt. C (actually it ties you to a certain mode of thinking).
2 comments

I don't think "arbitrary linear algebra operations" is a valid critique. If you understand PCA as "take the SVD of the data", then the operations seem arbitrary. But if you understand it as, "construct a low-rank approximation in the L2 sense to the data, or its covariance", then it's not.

Also, I don't think that the (very legitimate) "dimensional" critique of PCA applies here. The units on the coordinates of the representation are the same: the presence or absence of that prime factor.

To the original question: my suspicion is that PCA might pull out the even numbers (first PC) and the divisible by 3 numbers (second PC), because these two factors may explain the most variability in the underlying vector representation. If it did, that would be pretty intuitive, although not as interesting.

---

Edited to add: Suspicion turned out to be true. For the first 2000 integers, the top 6 PCs turned out to correspond to the first 6 primes (2, 3, 5, 7, 11, 13).

Plot at: https://imgur.com/a/qi2Sx5u?

  function [nums,pcs]=pca_prime(nMax,nPC)

  nums = zeros(nMax, nMax);
  for k = 2:nMax,
    nums(k,factor(k)) = 1; % vector representation of "k"
  end;
  % 2:end because don't care about 1 as a "prime"
  pcs = pca(nums(2:end,:), 'NumComponents', nPC);
--

  [nums,pcs]=pca_prime(2000,10); % "svd" would work too
  plot(pcs(:,1:6)); % first 6 PCs
If you think of the covariance matrix, entry i,j for i ≠ j will be

   floor(n / (p[i]*p[j])) / n - floor(n / p[i]) * floor(n/p[j]) / n^2
and the ith diagonal entry will be

   floor(n / p[i]) / n - ( floor(n / p[i]) / n )^2
for n large, you approximately get a diagonal matrix with diagonal entries / eigenvalues 1/p[i] - 1/p[i]^2.
Smart observation. Another way to say it is that, for distinct primes p1 and p2, the events “p1 divides n”, and “p2 divides n”, are approximately statistically independent. So you get a near-diagonal covariance with entries as you wrote.
> If you understand PCA as "take the SVD of the data", then the operations seem arbitrary. But if you understand it as, "construct a low-rank approximation in the L2 sense to the data, or its covariance", then it's not.

Those are the same thing. Is your first case just a reader who has no idea what the SVD is?

Yes, they are both descriptions of the same thing. I'm trying to say that PCA does have a justification. It's not just an "arbitrary linear algebra operation", although the application of the SVD algorithm to perform PCA can be presented that way.
Are you saying the kth principal component is equal to the kth prime number?
Yes. See the plot: for example, PC #1 is essentially a 0/1 vector with all its weight (1 in this case) placed on the "2", which is the representation of the prime number 2 in the scheme used in the OP.
> dimensionally invalid

What does this mean?

Without (arbitrary) scaling normalization PCA gives different results with change of dimensional units, that is principal components depend on choice of units or scaling.
Oh okay. I did know about that weakness of PCA. But it seems like in this case letting one prime factor correspond to one unit of distance is a nonarbitrary scaling, so I would expect PCA to give sensible results.