Hacker News new | ask | show | jobs
by andrewmatte 2470 days ago
Nice article. I am interested to see this stuff blended with the GPU/ML powered databases rather than the TF-IDF of decades ago - as well as it works
1 comments

Someone needs to make a DB with first class support for dense feature vectors (embeddings) and approximate nearest neighbor search.

These two features would allow you to do visual search, semantic text search, recommendations, learning to rank and etc.

I'd love to have something like that. As far as I'm aware, one big limiting factor is that there aren't currently any great ways to do an index for approximate nearest neighbor search that doesn't require you to keep the whole index in memory. A disk-friendly indexing method would make it just a PostgreSQL plugin away.
There are no good exact indexing structures but there are a lot of very high performance approximate NN structures. Facebook has an open source implementation of some of these in a project called faiss [0] which does a relatively good job of this.

[0] - https://github.com/facebookresearch/faiss

At Frame.ai, we are using both PostgreSQL and faiss (and other tools) in our stack to do several different kinds of inference tasks on semantic representations of text to help companies understand and act on customer chats, emails, and phone call transcripts.

We've frequently had the same dream of adding more native support for nearest-neighbor type queries, since that is the workhorse of so many useful techniques in the modern NLP stack.

Right now, we have lots of dense vectors stored in massive toast tables in PG. It's faster to fetch them rather than recompute them, especially since there are a number of preprocessing steps that limit what we pay attention to.

The discussion here about full text search versus semantic search is interesting. In our experience, both are highly relevant. Sometimes it's most useful for our customers to segment their conversation data by exact text matches, and other times semantic clustering is most effective. I think there's plenty of reason to offer both kinds of capabilities.

Elasticsearch now has that in versions 7.3 and later

https://www.elastic.co/guide/en/elasticsearch/reference/curr...

The vectors are only used for scoring, not matching, but they are working on a ANN model for that.

Sphinx Search has an engine plugin for MySQL