Hacker News new | ask | show | jobs
by vasilyt 106 days ago
Love this, and really appreciate you sharing concrete lessons from running at that scale.

Your point about consensus breaking between 10 and 300 tracks with what we’re seeing too. We chose Queen/Worker mostly for operational predictability, but we’re actively testing less centralized patterns (including debate-style synthesis similar to your oracle setup) to recover some of the diversity benefits without losing controllability.

The safety note is especially on point. “Unprogrammed coordination” is real, and we’re adding stronger circuit breakers and governance backstops specifically because social dynamics emerge faster than expected.

Also agree on benchmarking: collectives seem best on ambiguous, multi-perspective problems; single agents still dominate narrow, well-scoped execution.

If you’re open to it, I’d love to compare evaluation setups. 20K+ fragments is a serious dataset, and a shared benchmark pass could be genuinely useful for the whole space.