Hacker News new | ask | show | jobs
by phatfish 120 days ago
Does anyone know what this "APEX-Agents benchmark for long time horizon investment banking, consulting and legal work" actually evaluates?

That sounds so broad that creating a meaningful benchmark is probably as difficult as creating an AI that actually "solves" those domains.