|
|
|
|
|
by shubhamintech
103 days ago
|
|
The full-session evaluation framing is the right call - most teams don't realize the failure happened in turn 2 until they've spent 3 hours blaming the model. One thing worth thinking about as you grow: connecting caught regressions to production conversation data. When your simulation flags a new failure mode, being able to say "this pattern has already surfaced X times in prod this week" cuts the prioritization debate in half. Does Cekura currently let you correlate simulation failures back to real user sessions, or is that still a manual step? |
|
Doing it in production also helps to go run simulations by replaying those production conversations ensuring you are handling regression.