|
|
|
|
|
by llm_trw
487 days ago
|
|
>This brings us to our current limitations. Right now, DeepEval’s primary evaluation method is LLM-as-a-judge. We use techniques such as GEval and question-answer generation to improve reliability, but these methods can still be inconsistent. Even with high-quality datasets curated by domain experts, our evaluation metrics remain the biggest blocker to our goal. Have you done any work on dynamic data generation? I've found that even taking a public benchmark and remixing the order of questions had a deep impact on model performance - ranging from catastrophic for tiny models to problematic for larger models once you get past their effective internal working memory. |
|
Do do synthetic data generation for custom application use cases. Such as RAG, summarization, text-sql, etc. We call this module the "synthesizer", and you can customize your data generation pipeline however you want (I think, let me know otherwise!).
Docs for synthesizer's here: https://docs.confident-ai.com/docs/synthesizer-introduction, there's a nice "how does it work" section at the bottom explaining it more.