|
|
|
|
|
by callesgg
1178 days ago
|
|
Now, use this library to "bootstrapp the smarts of LLaMA from its own smartness" like this: 1. Ask it things. Let it answer. 2. Ask it to find errors in the answer it outputted and for it to correct the answer. 3. Use the original prompt and the corrected output as training data. This should, with each iteration make the model less and less likely to output statements that are self contradictions or obviously wrong, until the model can no longer spot its own faults. |
|
But if you let an AI's approval be the metric, things turn a lot more fussy and subjective. The goal is not actually "to write a good answer without error" but actually "to write an answer that is approved by the AI". Those are very different goals, and as you keep using it you'll get a bigger and bigger divergence, until eventually the AI is just answering complete garbage nonsense that precisely hits certain sweet spots in the grading AI.
This divergence of the target vs the actual human goal is a pretty interesting problem in AI safety research. I love the example where an AI trained to stay alive as long as possible in Tetris realized that pausing the game was the best strategy.