| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by numinary1 3489 days ago

When I was working on a recommender for television shows, I ran SVD on a large User/Item matrix to create a low rank approximation, essentially reducing thousands of user features (TV show preferences) to user vectors representing twenty or thirty abstract "features". Then I looked at the actual item preferences of users who expressed each feature at the greatest and least magnitude. The features, in some cases, mapped to recognizable constructs. There were distinct masculine and feminine features, several obvious Hispanic / Latino elements, and strong liberal versus conservative indicators. Others were less explainable using common labels.

It struck me at the time that the qualities that were expressed most strongly were the ones that ended up having names in our language. But there were others for which I would say to myself, there is something about this group (e.g. those with the greatest expressed value of F124) that I recognize, but can't quite put my finger on.

Of course, I was looking at people through a keyhole, their TV viewing preferences being the only information I had.

Also, I noticed that these "came into focus" most clearly at a certain level of compression (rank).

FWIW

4 comments

jmde 3488 days ago

I started reading the essay not knowing what to think, and it turned out to be more relevant to my work than I thought.

The issues being discussed in the essay have been a central issue in some area of psychology and behavioral sciences for some time--how to interpret components such as these.

One thought about your "coming into focus at a certain level of compression" comment: I've done some analyses of these vectors as applied to text samples, and one thing that struck me was how unreplicable some of them were across datasets that should be ostensibly similar (but are not the same). Others, in contrast, reappeared across multiple corpora. To the extent some of these components represent "real" features, they should reappear consistently across different datasets where you'd expect them to. That is, they should be robust to changes in idiosyncratic features of the database.

link

yxhuvud 3489 days ago

Did you ever compare that focus with a graph over the singular values?

link

thearn4 3489 days ago

It's a good question, FWIW I would expect a reasonably sharp "L" shaped curve in the focus. The assumption there I guess being that this metric of 'focus' is something well characterized by low-frequency type basis matrices given by the first few rows/columns of the SVD's U and V.

link

numinary1 3488 days ago

Exactly what I saw. Your expectation is correct.

link

gallerdude 3489 days ago

Computers have us figured out in a way that we don't.

link

sweetdreamerit 3488 days ago

A question: is this much better / different than a principal component analysis (or a factor analysis)?

link

antognini 3488 days ago

It's a bit of an apples/oranges comparison to compare SVD to PCA. SVD is a numerical technique, whereas PCA is a method to analyze a dataset. You can use SVD to perform PCA (although there are other ways to perform PCA without explicitly doing a SVD). I'm guessing that the GP performed PCA using SVD. There's a good Stack Exchange answer to exactly this question here:

http://stats.stackexchange.com/questions/121162/is-there-any...

link

zo7 3488 days ago

One way to do PCA is using SVD to find a transformation matrix of eigenvectors to project your data with, so they're similar.

link