| OpenAI published an article and demo for scoring how well AI agents can work in a codebase (https://openai.com/index/harness-engineering/, https://www.youtube.com/watch?v=rhsSqr0jdFw). We turned it into a free tool anyone can use. Paste any public GitHub repo (or connect a private one) and get a live score across seven dimensions: bootstrap setup, task entry points, test harnesses, lint gates, agent docs, structured documentation, and decision records.
It clones the repo, runs static analysis, and scores each dimension 0-3 with evidence pulled from actual files. Takes about 60 seconds. Some repos we scored: PostHog: https://twill.ai/score/fd033516-628b-4c7c-8db6-d84e3f2737ba Supabase: https://twill.ai/score/b2825715-6c3d-4de1-a21b-fc5d9b17103b Codex: https://twill.ai/score/d7372d95-0501-4ad3-ae90-8f112ccafee0 The pattern we keep seeing: most repos lose points on agent-specific docs and decision records. Everything else tends to be decent. We built this scorecard as a free tool because agent performance is bounded by repo structure, not just model quality. Would love to hear what scores people get. And whether the rubric is missing anything. |