|
|
|
|
|
by jakozaur
109 days ago
|
|
As I do eval and training data sets for living, in niche skills, you can find plenty of surprises. The code is open-source; you can run it yourself using Harbor Framework: git clone git@github.com:QuesmaOrg/BinaryAudit.git export OPENROUTER_API_KEY=... harbor run
--path tasks
--task-name lighttpd-*
--agent terminus-2
--model openrouter/anthropic/claude-opus-4.6
--model openrouter/google/gemini-3-pro-preview
--model openrouter/openai/gpt-5.2
--n-attempts 3 Please open PR if you find something interesting, though our domain experts spend fair amount of time looking at trajectories. |
|