|
|
|
|
|
by sgk284
491 days ago
|
|
> A more robust approach would be to give the whole reasoning to an LLM and ask to grade according to a given criterion We actually use a variant of this approach in our reasoning prompts. We use structured output to force the LLM to think for 15 steps, and in each step we force it to generate a self-assessed score and then make a decision as to whether it wants to CONTINUE, ADJUST, or BACKTRACK. - Evaluate quality with reward scores (0.0-1.0)
- Guide next steps based on rewards:
• 0.8+ → CONTINUE current approach
• 0.5-0.7 → ADJUST with minor changes
• Below 0.5 → BACKTRACK and try different approach
I go into a bit more depth about it here, with an explicit example of its thinking at the end: https://bits.logic.inc/p/the-eagles-will-win-super-bowl-lix |
|