Hacker News new | ask | show | jobs
by fzliu 1317 days ago
I'm surprised to see that ML-based semantic search is barely touched on in this article. There's a strong focus on entity matching, but an arguably more powerful way to conduct similarity search is to leverage embedding vectors from trained models.

A great upside to this approach is that it works for a variety of different types of unstructured data (images, video, molecular structures, geospatial data, etc), not just text. The rise of multimodal models such as CLIP (https://openai.com/blog/clip) makes this even more relevant today. Combine it with a vector database such as Milvus (https://milvus.io) and you'll be able to do this at scale with very minimal effort.

3 comments

Shameless plug - for folks who don't want to take on the work of model selection, on-demand scaling of model serving, scaling the vector database for search set size and query throughput, we built a service that hides all this behind a simple API [1]. The example in [1] is for images, but here is quick-start for text [2].

[1] https://www.nyckel.com/semantic-image-search [2] https://www.nyckel.com/docs/text-search-quickstart

Maybe from ... "at scale" bit? ML approaches are relatively computationally expensive.
And opaque and hard to modify.
The latent spaces created by neural networks inherently de-dupe data.