Hacker News new | ask | show | jobs
by andreyk 3573 days ago
"To correct for this, we feed the age of the training example as a feature during training"

Does this mean something different from feeding the age of the video, relative to when the training example was recorded? Feeding in the age of the video seems like a fairly obvious idea and like it should train the network to favor newer videos. If it actually means how long ago the training example was recorded that is rather strange, as I don't see how that would be needed on top of the video age. Neat graph, there.

I am often annoyed at how overly focused online recommendations systems are for my overly specific recent trends, rather than broader interests I display over months or years of using a product (looking at you Amazon). It seems like it should be relatively easy to learn 'this guy likes little video essays about art and science and sometimes fun talk shows' and yet YouTube has been pretty bad at recommending such video-essay style content to me. Perhaps this will improve it, although I wonder how much the recent history features end up overwhelming overall years-long type data about what interests me broadly and not just yesterday.

As an aside, is it really "Deep Neural Networks for YouTube Recommendations" if you are using 5-ish layers of embedding, ReLu units, and output? A bit humorous, that.

1 comments

Your intuition is correct - there are other ways to capture the non-stationary nature of this particular problem. We thought that the example age approach is neat because it is a general technique for removing bias inherent to any machine learning system. Since examples always come from the past, you often have to be careful to prevent any system from being overly biased towards historical behavior. You don't need any additional metadata about items (what's the age of a search query?) and it's more resilient to predicting in regions the model has never seen because you fix serving to the very end of the training window.

I tend to think the focus on recent behavior is an artifact of underfitting. Research into richer temporal modeling is needed and recurrent networks seem promising.

We debated internally whether to use the "deep" moniker - Alexnet was 8 layers, so maybe the threshold is 8? The depth seems sort of irrelevant since stacking layers is trivial once the basic architecture is in place.