|
|
|
|
|
by fatso784
1056 days ago
|
|
I like the support for Vector DBs and LLaMa-2. I'm curious as to whether and what influences compelled PromptTools, and how it differs from other tools in this space. For context, we've also released a prompt engineering IDE, ChainForge, which is open-source and has many of the features here, such as querying multiple models at once, prompt templating, evaluating responses with Python/JS code and LLM scorers, plotting responses, etc (https://github.com/ianarawjo/ChainForge and a playground at http://chainforge.ai). One big problem we're seeing in this space is over-trust in LLM scorers as 'evaluators'. I've personally seen that minor tweaks to a scoring prompt can sometimes result in vastly different evaluation 'results.' Given recent debacles (https://news.ycombinator.com/item?id=36370685), I'm wondering how we can design LLMOps tools for evaluation which both support the use of LLMs as scorers, but also caution users about their results. Are you thinking similarly about this question, or seen usability testing which points to over-trust in 'auto-evaluators' as an emerging problem? |
|
We offer auto-evals as one tool in the toolbox. We also consider structured output validations, semantic similarity to an expected result, and manual feedback gathering. If anything, I've seen that people are more skeptical of LLM auto-eval because of the inherent circularity, rather than over-trusting it.
Do you have any suggestions for other evaluation methods we should add? We just got started in July and we're eager to incorporate feedback and keep building.