Hacker News new | ask | show | jobs
by visarga 1323 days ago
Semantic similarity more concretely means to use neural nets to embed the text, then use cosine similarity or dot product to compute the score between two entities.

embed1 = neural_net(txt1)

embed2 = neural_net(txt2)

sim_score = np.dot(embed1, embed2)

If you're making a search engine you precompute the embeds for all the items in your database. When a user performs a search you just need to embed the query and do the dot products, which are pretty fast for small indexes.

Assuming you want to index millions or billions of entities doing dot products is inefficient because it scales linearly in the size of the index. There is a trick (similar to binary search) that will find the top-k most similar results in O(log(N)) time, called approximate nearest neighbour (ANN). There are a few good libraries for that.

4 comments

Are there any semantic search implementations focused on.. small, local deploys?

Eg i'm interested in local serverless setups (on desktop, mobile, etc) that yield quality search results in the ~instant~ time frame, but that are also complete and accurate in results. Ie i threw out investigating ANN because i wanted complete results due to smaller datasets.

hnswlib is in cpp and has python bindings (you should be able to make your own for other languages). Faiss, Annoy (by Spotify) should also provide similar functionality.

https://github.com/nmslib/hnswlib

For anybody interested in why this comment says "cosine similarity _or_ dot product", its because the vectors in word embedding models are typically scaled to unit length.

If cos(theta) := A.B / (|A|^2 * |B|^2)

And A and B are normalised, then the denominator is 1, and the RHS is equal to the dot product.

Thanks for the clarification, I should have mentionned it
Taking notes, thanks :)