Hacker News new | ask | show | jobs
by danoandco 86 days ago
OpenAI published an article and demo for scoring how well AI agents can work in a codebase (https://openai.com/index/harness-engineering/, https://www.youtube.com/watch?v=rhsSqr0jdFw). We turned it into a free tool anyone can use.

Paste any public GitHub repo (or connect a private one) and get a live score across seven dimensions: bootstrap setup, task entry points, test harnesses, lint gates, agent docs, structured documentation, and decision records. It clones the repo, runs static analysis, and scores each dimension 0-3 with evidence pulled from actual files. Takes about 60 seconds.

Some repos we scored:

PostHog: https://twill.ai/score/fd033516-628b-4c7c-8db6-d84e3f2737ba

Supabase: https://twill.ai/score/b2825715-6c3d-4de1-a21b-fc5d9b17103b

Codex: https://twill.ai/score/d7372d95-0501-4ad3-ae90-8f112ccafee0

The pattern we keep seeing: most repos lose points on agent-specific docs and decision records. Everything else tends to be decent.

We built this scorecard as a free tool because agent performance is bounded by repo structure, not just model quality.

Would love to hear what scores people get. And whether the rubric is missing anything.