Hacker News new | ask | show | jobs
by syntex 54 days ago
These benchmarks means very little. The real test is model + harness so agentic system that can fulfill given goals.