|
|
|
|
|
by krawfy
1058 days ago
|
|
Great question, chainforge looks interesting! We offer auto-evals as one tool in the toolbox. We also consider structured output validations, semantic similarity to an expected result, and manual feedback gathering. If anything, I've seen that people are more skeptical of LLM auto-eval because of the inherent circularity, rather than over-trusting it. Do you have any suggestions for other evaluation methods we should add? We just got started in July and we're eager to incorporate feedback and keep building. |
|
For suggestions, one thing I'm curious about is how we can have out-of-the-box benchmark datasets and do this responsibly. ChainForge supports most OpenAI evals, but from adding this we realized the quality of OpenAI Evals is really _sketchy_... duplicate data, questionable metrics, etc. OpenAI has shown that trusting the community to make benchmarks is perhaps not a good idea; we should instead make it easier for scientists/engineers to upload their benchmarks and make it easier for others to run them. That's one thought, anyway.