|
|
|
|
|
by janalsncm
479 days ago
|
|
The demo looked sharp but I am curious if you have done any formal evaluation of the quality of the results? For example, MRR and recall@k, even on a toy dataset? Seems like the quality of the generated responses will be highly dependent on the docs which are retrieved. |
|
We checked the recall at 4K tokens (which was a pretty typical token limit of the previous generation of LLMs) and we were at over 94% recall for our 10K document set. We also added a lot of noise to it (Slack messages from public Slack workspaces) to get hundreds of thousands of documents but recall remained at over 90%.