| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by chrtng 605 days ago
	Thank you for your question! While we haven't published a formal evaluation yet, it's something we are working toward. Currently, we rely mostly on human reviews to monitor and assess LLM outputs. We also maintain a golden test suite that is run against every release to ensure consistency and quality over time, using regex-based evaluations. Our key metrics include the time and cost per agentic loop, as well as the false positive rate for a full end-to-end test. If you have any specific benchmarks or evaluation metrics you'd suggest, we'd be happy to hear them!

1 comments

aksophist 604 days ago

What is a false positive rate? Is it when the agent falsely passes or falsely “finds a bug”? And regardless of which: why don’t you include the other as a key metric?

I’m not aware of any evals or shared metrics. But measuring a testing agents performance seems pretty important.

What is your tool’s FPR on your golden suite?

link