Hacker News new | ask | show | jobs
by dnc 2415 days ago
"The difference between this and word2vec is that the primitive atomic unit we are embedding isn’t a word, but a search query. Therefore this isn’t a language model, but a query model."

If a primitive embedding unit is a search query instead of a word, then I assume that a query vector model is trained on a limited dictionary of queries. I wonder if that implies that the trained vector model can encode only search queries that are already present in its dictionary? If not, I think it would be interesting to know more about how the closed dictionary problem was solved.

1 comments

Correct, the model has a fixed vocabulary set at train-time. However, we can benefit from the fact that the distribution of queries in most search systems follows the Zipfian distribution. That means we can capture the vast majority of queries that are ever issued in a fixed vocabulary. Daily retraining helps us pick up new queries outside of the vocabulary.
How is your model fundamentally different from doc2vec though? The idea of swapping a vocab of words for a vocab of tokens, sentences, paragraphs, queries, articles, pages, etc., has been around and used in research and production virtually since the same time as word2vec itself.

One of the most common “off the shelf” search solutions is to train an embedding for X (where X is whatever you want to search) using some implicit similarity or triplet loss for examples that have natural positive labels (like due to proximity in the word2vec case) plus random negative sampling, and use the embedding in an exact or approximate nearest neighbor index.

In fact, it’s even very “off the shelf” to use Siamese networks for multi-modal embeddings, for example simultaneously learning embeddings that put semantically similar queries and food images into the same ANN vector space.

I think the blog post is very cool don’t get me wrong, but no part of this is novel (if that’s what you were going for, I can’t tell). This exact end to end pipeline has been used for image search, collaborative filtering (customer & product embeddings), various recommender systems where the unit of embedding is something like “pages” or “products” or “blog posts” or other product primitives for whatever type of business, in production across several different companies I’ve worked for.

I agree:

In the IR community you will see image2image search implemented by passing an image through a headless CNN and then do ANN search on the embedding space. Then that can lead to discussion about cross-model hashing which you suggested. I think also you touched on LTR techniques w/ siamese networks. For example we can use a siamese network to learn the diff between two embeddings.

In the Recys community for collaborative filtering you can take interaction data between users and items and decompose that matrix into two parts using SVD which is called the user and item embeddings. Also in the recsys community you see people doing item2vec (what I did) for the item embeddings and then aggregating those for the user representation (embeddings).

Maybe you will grant me the novelty of this method in making the connection between queries and converted restaurants as a similarity metric and then applying it to the IR use case of search query expansion? Should the post be more clear that I am not claiming to have invented item2vec or ANN (approximate nearest neighboors)?

Your final paragraph is reasonable, and I do think the post would benefit from more clearly talking about this.

The “trick” is to find some naturally occurring property that indicates a pair of examples has the positive label, such as two users selecting a given product, two text queries leading to clicking on the same image, two different blog posts with the same keyword, etc. This combined with an efficient way to sample acceptable negative pairs that do not express the trait that indicates positive label.

That + deciding how to approach the loss function (e.g. cosine similarity, euclidean distance, distance in a hash space, triplet loss, should it have a margin) is the whole trick of the problem.

The black box that produces embeddings (DNN, doc2vec, sparse vector from tfidf-like features, whatever) and the nearest neighbor piece are standard plug and play.

thanks, I think we are in agreeance.