Hacker News new | ask | show | jobs
by aeternum 482 days ago
This seems useful for issues early in the convo but what if the AI responses diverge from the recorded convo prior to the issue being hit?
1 comments

That’s a great point! While we do our best to simulate an identical case, if the agent responds differently, our focus is on whether the key evaluator or goal for that replay set passes or fails. We use that as the source of truth and flag the exact moment where the conversation diverges from the expected flow.