I recently wrote a post outlining our method to reduce hallucinations in LLM agents by leveraging a verified semantic cache. The approach pre-populates the cache with verified question-answer pairs, ensuring that frequently asked questions are answered accurately and consistently without invoking the LLM unnecessarily.
The key idea lies in dynamically determining how queries are handled:
- Strong matches (≥80% similarity): Responses are directly served from the cache.
- Partial matches (60–80% similarity): Verified answers are used as few-shot examples to guide the LLM.
- No matches (<60% similarity): The query is processed by the LLM as usual.
This not only minimizes hallucinations but also reduces costs and improves response times.
The key idea lies in dynamically determining how queries are handled:
- Strong matches (≥80% similarity): Responses are directly served from the cache.
- Partial matches (60–80% similarity): Verified answers are used as few-shot examples to guide the LLM.
- No matches (<60% similarity): The query is processed by the LLM as usual.
This not only minimizes hallucinations but also reduces costs and improves response times.
Here's a Jupyter notebook walkthrough if anyone's interested in diving deeper: https://github.com/aws-samples/Reducing-Hallucinations-in-LL...
Would love to hear your thoughts—anyone else working on similar techniques or approaches? Thanks.