| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tzs 1467 days ago

Even fairly simple systems might not have a clear way to answer. Consider a movie recommendation system.

One simple way you could build such a system is to come up with a list of things about movies that might affect whether or not someone would like them, where for each thing on the list we can assign each moved a number from -1 to 1 that says how much of that thing the movie has. Call this list the movie's vector.

Some examples of things we might pick are how much comedy is in the movie, how much romance is in the movie, presence of A-list stars, how musical it is, and thing like that. We might also have items for specific stars or directors.

Then we could go through our movie collection and have someone figure out each movies scores for all those things in our list.

Then we could figure out for our users a list that lists for each of those things how important it is to that user, from -1 (I hate movies that have this!) to 1 (I love movies that have this!). Let's call this the user preferences vector. If we have a list for a given user of movies they have already watched and how they rated them on say a 0 to 5 scale then it is some straightforward math to figure out the user preferences vector for that user that does the best job classifying the movies they have already seen in a way that agrees well with that user's ratings.

That user preferences vector can than be used to recommend new movies and should work pretty well if (1) we picked a good list of things to score movies on, and (2) when we manually assigned the scores we got it right.

To predict how well a user would like a given move we just take their user preference vector and compute the dot product of it with the movie's vector. The more positive that result the more we think the user would like the movie.

With this system, it would be easy to tell someone why you recommended a movie. We could look at their preference vector and compare it to the movie and tell them things they really like that the movie has and things they really hate that the movie does not have.

But the system described above has a drawback. It is hard to figure out what factors to include in the movie classification. Should comedy for example be one item, or should it be broken down into several such as physical comedy, insult comedy, bodily function comedy, and so on?

Also, if you have a large collection of movies it is a lot of work to go through them all and score them on each factor. And if you later find out you need to add or remove factors you have to do it again.

It turns out that there is a way to sidestep both the "what should my factors include?" and "how do we get the factors scored?" problems.

What you do is just decide on how many factors you will have. So let's say we decide we are going to have 50 factors. We don't have to decide what they mean. We'll just call then F1, F2, F3, ... for now.

Initially we just assign each factor a random value from -1 to 1.

We also do the same thing for the initial user preference vectors. Just assign each factor in the preference vector a random value.

Then we can do a loop, consisting of these two steps:

1. Using the known 0 to 5 star ratings from users of films they have seen, adjust their preference vectors so that ordering movies by the dot products of the movie vectors with the preferences vector matches the ordering by the user's star ratings.

2. Same thing, except instead of adjusting the preference vector to better work with all the movies a user has seen, adjust the movie vector of each movie to better work with the preference vectors or all the users have have rated that movie.

Keep looping until things aren't changing much. You then end up with a set of movie vectors for you movie catalog and preference vectors for your users that do a good job of ordering movies that user has seen that matches well with how the user rated those movies, and that likely does a good job predicting how well they will like new movies.

This was with an arbitrary decision to have 50 factors in our movie vectors. Since the process described above can be entirely automated and is pretty quick, we can experiment with difference vector sizes.

We end up with a recommendation system that is very likely much better than the one we would get if we picked the items that went into the movie vector.

It is still a simple system, just like it was when we came up with the components ourselves and assigned them by watching the movies ourselves.

But notice that we now have no idea what the heck the factors in the movie vector mean. We can no longer tell someone we recommended a movie because they like comedy and romance and hate kids movies and this movie fits that well.

We could tell them they liked past movies that have a high score in components 12, 15, 23, and 102 and a low score in components 19, 77, 83, and 107, but they are probably not going to find that to be a useful answer.