| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by fzliu 1317 days ago
	I'm surprised to see that ML-based semantic search is barely touched on in this article. There's a strong focus on entity matching, but an arguably more powerful way to conduct similarity search is to leverage embedding vectors from trained models. A great upside to this approach is that it works for a variety of different types of unstructured data (images, video, molecular structures, geospatial data, etc), not just text. The rise of multimodal models such as CLIP (https://openai.com/blog/clip) makes this even more relevant today. Combine it with a vector database such as Milvus (https://milvus.io) and you'll be able to do this at scale with very minimal effort.

3 comments

saintarian 1317 days ago

Shameless plug - for folks who don't want to take on the work of model selection, on-demand scaling of model serving, scaling the vector database for search set size and query throughput, we built a service that hides all this behind a simple API [1]. The example in [1] is for images, but here is quick-start for text [2].

[1] https://www.nyckel.com/semantic-image-search [2] https://www.nyckel.com/docs/text-search-quickstart

link

marginalia_nu 1316 days ago

Maybe from ... "at scale" bit? ML approaches are relatively computationally expensive.

link

HyperSane 1316 days ago

And opaque and hard to modify.

link

HyperSane 1316 days ago

The latent spaces created by neural networks inherently de-dupe data.

link