Hacker News new | ask | show | jobs
by houmercodes 133 days ago
Genuine question about the eval methodology — how do you handle website non-determinism?

A lot of these sites serve different layouts, A/B tests, cookie consent modals, etc. across sessions. Did you control for that across agents, or is each agent hitting the live site independently at different times?

Because if so, some of the variance between agents could just be "Operator happened to get the GDPR popup and didn't know how to dismiss it." Would be useful to know if all agents were evaluated on the same snapshots or same time window.