|
|
|
|
|
by mschwarz
69 days ago
|
|
Day to day I run a dev pod (implementor + qa + frontend design), a review pod doing adversarial review with one Claude and one Codex, and an orchestrator pair. I think the best flex here to illustrate real work being done is so far the longest single rig I've kept running continuously was about 4 days, so that means a large implementation spec being executed with test driven dev approach from obra superpowers + independent deep contextual code reviews at milestones (my own skill pack) + automated vercel agent-browser testing along the way. So currently it's a closed sdlc loop that is only limited by the amount of work I gave it. The "babysitting agents" part moves me up a layer to watching for spec drift and handling weird edge cases that come up. So its not set and forget but you can definitely have it work on something real overnight to get that 'my agents shipped code for while I slept' kind of outcome. I watch a demo video in the morning to see what they built, then do my own code review spot checks of pr's. The original motivation for making OpenRig is this pattern works well I've been doing this for months now, and I'm sure many people have also gotten something like to work, but the topology is fragile. Like the sessions die, your laptop needs a reboot, you lose the setup you built up that took weeks to perfect. OpenRig makes the topology itself a first-class thing, like a docker-compose but for the topology of claude codes / codex on your machine and all their specific context and configs you fine-tuned. Regarding supervision - that is the key question for sure - I can't really babysit more than 4-5 agents without feeling like I've lost the plot a bit. So the demo pod in the onboarding includes an example of a pattern I use where there are 2 orchestrators in a "high availability" pair, so I just really interact with 1 agent for the workstream - the orch-lead. The peer is there to monitor and absorb the lead's mental model in realtime, and can take over for the rig if the lead's context limit hits the wall, or something else goes wrong. |
|
I tried doing the same for the cases of maintaining OSS projects. So far, best I could manage is to get the agents to autonomously do %80~ of the work. But then, I have to review manually each potential PR, and almost in every case to further work with an agent providing it with guidance live to fix it. This takes about as much time as without the swarm. So far I found that the usefulness of the swarm is mostly for the initial scouting, to map out what work needs to be done in first place, and store it in a nice JSON file.
From my observations, all it takes is one mistake for an agent to make, from there, the architecture just snowballs into chaos as the future work builds on top of incorrect initial approach.