| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by aryamanagraw 149 days ago
	We kept asking LLMs to rate things on 1-10 scales and getting inconsistent results. Turns out they're much better at arguing positions than assigning numbers— which makes sense given their training data. The courtroom structure (prosecution, defense, jury, judge) gave us adversarial checks we couldn't get from a single prompt. Curious if anyone has experimented with other domain-specific frameworks to scaffold LLM reasoning.

3 comments

deevelton 149 days ago

Experimented very briefly with a mediation (as opposed to a litigation) framework but it was pre-LLM and it was just a coding/learning experience: https://github.com/dvelton/hotseat-mediator

Cool write-up of your experiment, thanks for sharing. Would be interesting to see how results from one framework (mediation, whose goal is "resolution") differ from the other (litigation, whose goal is, basically, "truth/justice").

link

aryamanagraw 149 days ago

That's really cool! That's actually the standpoint we started with. We asked what a collaborative reconciliation of document updates looks like. However, the LLMs seemed to get `swayed` or showed `bias` very easily. This brought up the point about an adversarial element. Even then, context engineering is your best friend.

You kind of have to fine-tune what the objectives are for each persona and how much context they are entitled to, that would ensure an objective court proceeding that has debates in both directions carry equal weight!

I love your point about incentivization. That seems to be a make-or-break element for a reasoning framework such as this.

link

storystarling 149 days ago

The reasoning gains make sense but I am wondering about the production economics. Running four distinct agent roles per update seems like a huge multiplier on latency and token spend. Does the claimed efficiency actually offset the aggregate cost of the adversarial steps? Hard to see how the margins work out if you are quadrupling inference for every document change.

link

aryamanagraw 149 days ago

The funnel is the answer to this. We're not running four agents on every PR — 65% are filtered before review even begins, and 95% of flagged PRs never reach the courtroom. This is because we do think there's some value in a single agent's judgment, and the prosecutor gets to make a choice when to file charges vs not.

Only ~1-2% of PRs trigger the full adversarial pipeline. The courtroom is the expensive last mile, deliberately reserved for ambiguous cases where the cost of being wrong far exceeds the cost of a few extra inference calls. Plus you can make token/model-based optimizations for the extra calls in the argumentation system.

link

thatjoeoverthr 149 days ago

If you do want a numeric scale, ask for a binary (e.g. true / false) and read the log probs.

link

kyeb 149 days ago

(disclaimer: I work at Falconer)

you would think so! but that's only optimal if the model already has all the information in recent context to make an optimally-informed decision.

in practice, this is a neat context engineering trick, where the different LLM calls in the "courtroom" have different context and can contribute independent bits of reasoning to the overall "case"

link

aryamanagraw 149 days ago

That's the thing with documentation; there are hardly any situations where a simple true/false works. Product decisions have many caveats and evolving behaviors coming from different people. At that point, a numerical grading format isn't something we even want — we want reasoning, not ratings.

link