Hacker News new | ask | show | jobs
by maggreenWAI 592 days ago
A) For Mind2Web: because there are multiple ways to reach a goal state - any thoughts how to evaluate if a task was successful? Should we let the LLM/ other LLM evaluate it?