| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tomd 1224 days ago
	You might be interested in https://datasette.io/plugins/datasette-faiss, which I'm using alongside openai-to-sqlite for similarity search of embeddings, following @simonw's excellent instructions at https://simonwillison.net/2023/Jan/13/semantic-search-answer...

1 comments

uh_uh 1224 days ago

Thanks, but the index being in-memory makes it unsuitable for large data sets :/

link

simonw 1224 days ago

There is a way of running disk-backed FAISS indexed that don't all fit in memory but I've not quite figured out how to do that yet: https://github.com/facebookresearch/faiss/issues/2675

link

jamesblonde 1224 days ago

OpenSearch K-NN plugin supports FAISS and it's disk based:

https://opensearch.org/docs/latest/search-plugins/knn/index/

link

uh_uh 1223 days ago

OpenSearch looks like the best so far, all my requirements combined!

link

iandanforth 1224 days ago

Can you say more? Usually projects that gravitate to SQLlite are not those that require massive scale and a FAISS index of a few GB covers a lot of documents.

link

uh_uh 1224 days ago

My dataset is going to be around 10M documents. With OpenAI embeddings, that will be around 62GB. AFAIK SQLite should be able to handle that size, but I haven't tried.

This is not going to be my primary DB. I would update this maybe once a day and the update doesn't have to be super fast.

link

sharemywin 1224 days ago

you might check out some vector databases:

https://milvus.io/

AND

pinecone.io

there are others too

link