Hacker News new | ask | show | jobs
by teddykoker 1535 days ago
I wonder how their advertised "vector database" works. kNN combined with embeddings from pre-trained deep learning models can be very useful for information retrieval, (e.g. searching for duplicate/similar images or text).

In the past I have used a k-d tree [1] for this, which allows O(log n) searches in the vector space. It seems they are offering a k-d-tree-as-a-service.

[1] https://en.wikipedia.org/wiki/K-d_tree

2 comments

Pinecone stores and searches through dense vector embeddings using a proprietary ANN index. It also has live index updates and metadata filtering, which you’d expect from any database but is surprisingly hard to find or do with vector indexes.

As you said, common use cases include deduplication and image search, and especially semantic search (text).

Do you happen to know other implementations that allow for live updates and metadata filtering like Pinecone?
> kNN combined with embeddings from pre-trained deep learning models can be very useful for information retrieval

Indeed! We've been able to build simple reverse image search apps and other solutions using the power of embeddings from pre-trained ML models: https://gist.github.com/fzliu/c9380a7f9ba411adeff0b727cdba15....

One quick note: k-d trees are great for indexing low-dimensional data, but for high-dimensional embeddings they tend to be a poor indexing choice since you'll end up visiting more nodes in the tree than you'd like. I found [1] to be a great overview of different indexing types for high-dimensional vectors and the advantages of each.

[1] https://milvus.io/docs/index.md

For image retrieval, have you tried using a model trained with contrastive learning (e.g. SimCLR)? This could produce better embeddings for retrieval since the model is trained to explicitly minimize euclidean distance between similar pairs.

Thanks for the reference! Nice outline of various ANN approaches.

I haven't tried SimCLR, but I did try face embedding models trained with contrastive and triplet loss. For applications where precision is the key metric, I do agree that these loss functions are much better overall.

If discovery or recall is what you're after, a generic image classification model trained with binary cross-entropy might be better. For example, performing reverse image search on a photo of a German Shepherd should always return images of GSheps in the first N pages, but showing other dog breeds in later pages and possibly even cats after that would be a desirable feature for many search/retrieval solutions. An embedding model trained with contrastive loss might have this behavior to a certain extent, but a model based on BCE should be better.