Y
Hacker News
new
|
ask
|
show
|
jobs
by
xinweihe
317 days ago
Yep, we're working on a golden test set with known root causes to benchmark and track agent performance over time. It's taking a bit of work to get right, but we're on it and definitely open to contributions!