Hacker News new | ask | show | jobs
by tobrien6 1727 days ago
Amazon's Opensearch (fork of Elasticsearch) natively supports vector-based approximate KNN (using https://github.com/nmslib/nmslib/) which is integrated with Opensearch's native filtering functionality. Elasticsearch also has similar functionality, but I don't know if their KNN code scales quite as well.
1 comments

Opensearch only supports "pre-filtering" or "post-filtering," which leads to either high latency or incomplete results, as explained in the article.

This is why single-stage filtering was the most-requested feature for us.

From the Opensearch docs:

> You should not use approximate k-NN if you want to apply a filter on the index before the k-NN search, which greatly reduces the number of vectors to be searched.

> Because the graphs are constructed during indexing, it is not possible to apply a filter on an index and then use this search method. All filters are applied on the results produced by the approximate nearest neighbor search.

> If you use the knn query alongside filters or other clauses (e.g. bool, must, match), you might receive fewer than k results.

(https://opensearch.org/docs/search-plugins/knn/approximate-k...)

I know Elasticsearch is working on introducing vector search but it is not yet available. I don't know how they will support filtering.

The approximate kNN is quite nice for many use cases, and scales to billions of documents. However, you're correct that filtering happens on the results. This is only an issue in certain use cases where filtering is very narrow, as you can often just request much higher k than the number of results you really need without much slowdown.

If the filtering is very narrow, as you commented they also provide functionality to perform pre-filtering and then exact kNN on the results. This is of course higher latency, but still quite acceptable for many use cases (this is how I use it).

I believe there are use cases that Pinecone addresses better than Opensearch, but I want to let people know that there is a free, open-source solution which _may_ also work for their use case.

Elasticsearch does currently support vector search through script score using dense vector fields, however I suspect they are still working on improving it and I prefer the Opensearch implementation for the time being https://www.elastic.co/guide/en/elasticsearch/reference/curr...