|
|
|
|
|
by chrtng
605 days ago
|
|
Thank you for your question! While we haven't published a formal evaluation yet, it's something we are working toward. Currently, we rely mostly on human reviews to monitor and assess LLM outputs. We also maintain a golden test suite that is run against every release to ensure consistency and quality over time, using regex-based evaluations. Our key metrics include the time and cost per agentic loop, as well as the false positive rate for a full end-to-end test. If you have any specific benchmarks or evaluation metrics you'd suggest, we'd be happy to hear them! |
|
I’m not aware of any evals or shared metrics. But measuring a testing agents performance seems pretty important.
What is your tool’s FPR on your golden suite?