| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by teddykoker 1535 days ago

I wonder how their advertised "vector database" works. kNN combined with embeddings from pre-trained deep learning models can be very useful for information retrieval, (e.g. searching for duplicate/similar images or text).

In the past I have used a k-d tree [1] for this, which allows O(log n) searches in the vector space. It seems they are offering a k-d-tree-as-a-service.

[1] https://en.wikipedia.org/wiki/K-d_tree

2 comments

gk1 1535 days ago

Pinecone stores and searches through dense vector embeddings using a proprietary ANN index. It also has live index updates and metadata filtering, which you’d expect from any database but is surprisingly hard to find or do with vector indexes.

As you said, common use cases include deduplication and image search, and especially semantic search (text).

link

turnersr 1535 days ago

Do you happen to know other implementations that allow for live updates and metadata filtering like Pinecone?

link

generall 1535 days ago

Check out https://github.com/qdrant/qdrant

link

redskyluan 1534 days ago

See https://milvus.io/

link

mrintellectual 1535 days ago

> kNN combined with embeddings from pre-trained deep learning models can be very useful for information retrieval

Indeed! We've been able to build simple reverse image search apps and other solutions using the power of embeddings from pre-trained ML models: https://gist.github.com/fzliu/c9380a7f9ba411adeff0b727cdba15....

One quick note: k-d trees are great for indexing low-dimensional data, but for high-dimensional embeddings they tend to be a poor indexing choice since you'll end up visiting more nodes in the tree than you'd like. I found [1] to be a great overview of different indexing types for high-dimensional vectors and the advantages of each.

[1] https://milvus.io/docs/index.md

link

teddykoker 1535 days ago

For image retrieval, have you tried using a model trained with contrastive learning (e.g. SimCLR)? This could produce better embeddings for retrieval since the model is trained to explicitly minimize euclidean distance between similar pairs.

Thanks for the reference! Nice outline of various ANN approaches.

link

mrintellectual 1535 days ago

I haven't tried SimCLR, but I did try face embedding models trained with contrastive and triplet loss. For applications where precision is the key metric, I do agree that these loss functions are much better overall.

If discovery or recall is what you're after, a generic image classification model trained with binary cross-entropy might be better. For example, performing reverse image search on a photo of a German Shepherd should always return images of GSheps in the first N pages, but showing other dog breeds in later pages and possibly even cats after that would be a desirable feature for many search/retrieval solutions. An embedding model trained with contrastive loss might have this behavior to a certain extent, but a model based on BCE should be better.

link