|
|
|
|
|
by numinary1
3489 days ago
|
|
When I was working on a recommender for television shows, I ran SVD on a large User/Item matrix to create a low rank approximation, essentially reducing thousands of user features (TV show preferences) to user vectors representing twenty or thirty abstract "features". Then I looked at the actual item preferences of users who expressed each feature at the greatest and least magnitude. The features, in some cases, mapped to recognizable constructs. There were distinct masculine and feminine features, several obvious Hispanic / Latino elements, and strong liberal versus conservative indicators. Others were less explainable using common labels. It struck me at the time that the qualities that were expressed most strongly were the ones that ended up having names in our language. But there were others for which I would say to myself, there is something about this group (e.g. those with the greatest expressed value of F124) that I recognize, but can't quite put my finger on. Of course, I was looking at people through a keyhole, their TV viewing preferences being the only information I had. Also, I noticed that these "came into focus" most clearly at a certain level of compression (rank). FWIW |
|
The issues being discussed in the essay have been a central issue in some area of psychology and behavioral sciences for some time--how to interpret components such as these.
One thought about your "coming into focus at a certain level of compression" comment: I've done some analyses of these vectors as applied to text samples, and one thing that struck me was how unreplicable some of them were across datasets that should be ostensibly similar (but are not the same). Others, in contrast, reappeared across multiple corpora. To the extent some of these components represent "real" features, they should reappear consistently across different datasets where you'd expect them to. That is, they should be robust to changes in idiosyncratic features of the database.