Hacker News new | ask | show | jobs
by mathis-l 1075 days ago
You might want to give Haystack a try (disclaimer: I work at deepset, the company behind Haystack).

Haystack allows you to pre-process your documents into smaller chunks, calculate embeddings and index them into a document store. You can wrap all of that in a modular pipeline if you want.

Next, you can query your documents using a retrieval pipeline.

Regarding document store selection: Replacing your document store is easy, so I would start with the most simple one, probably an InMemoryDocumentStore. When you want to move from experimentation to production, you‘ll want to tailor your selection to your use case. Here‘s a few things that I‘ve observed.

You don’t want to manage anything and are fine with SaaS -> Pinecone

You have a very large dataset (500M+ vectors) and you want something that you can run locally -> maybe Qdrant

You have meta data that you want to incorporate into your retrieval or you want to do hybrid search -> Opensearch/Elasticsearch

Regarding model selection:

We‘ve seen https://huggingface.co/sentence-transformers/multi-qa-distil... work well for a good semantic search baseline with fast indexing times. If you feel like the performance is lacking, you could look at the E5 models. What also works fairly well for us is a multi-step retrieval process where we retrieve ~100 documents with BM25 first and then use a cross-encoder to rank these by semantic relevance. Very fast indexing times are a benefit and you also don’t need a beefy vector db to store your documents. Latency at query time will be slightly higher though and you might need a GPU machine to run your query pipeline.

Retrieval in Haystack: https://docs.haystack.deepset.ai/docs/retriever

Cross-Encoder approach: https://docs.haystack.deepset.ai/docs/ranker

Blog Post with an end-to-end retrieval example: https://haystack.deepset.ai/blog/how-to-build-a-semantic-sea...