Hacker News new | ask | show | jobs
by janalsncm 479 days ago
The demo looked sharp but I am curious if you have done any formal evaluation of the quality of the results? For example, MRR and recall@k, even on a toy dataset? Seems like the quality of the generated responses will be highly dependent on the docs which are retrieved.
2 comments

We have a dataset that we use internally to evaluate our search quality. It's more representative of our use case since it contains Slack messages, call transcripts, very technical design docs, company policies which is pretty different from what embedding models are typically trained on.

We checked the recall at 4K tokens (which was a pretty typical token limit of the previous generation of LLMs) and we were at over 94% recall for our 10K document set. We also added a lot of noise to it (Slack messages from public Slack workspaces) to get hundreds of thousands of documents but recall remained at over 90%.

I am also interested in how to do eval on an open source corporate search system. Privacy and information security make this challenging, right?
On privacy and security, we are the only option (as far as I know) that you can connect up to all your company internal docs and have it be all processed locally to the deployment and stored at rest within the deployment.

So basically you can have it completely airgapped from the outside world, the only tough part is the local LLM but there are lots of options for that these days.