Hacker News new | ask | show | jobs
by Drakim 1178 days ago
I recall reading that when training AlphaZero they would start pitching it against itself doing millions of games in a few days, which worked great because there is an external metric (who wins the chess game) that would objectively be a good measure to train towards.

But if you let an AI's approval be the metric, things turn a lot more fussy and subjective. The goal is not actually "to write a good answer without error" but actually "to write an answer that is approved by the AI". Those are very different goals, and as you keep using it you'll get a bigger and bigger divergence, until eventually the AI is just answering complete garbage nonsense that precisely hits certain sweet spots in the grading AI.

This divergence of the target vs the actual human goal is a pretty interesting problem in AI safety research. I love the example where an AI trained to stay alive as long as possible in Tetris realized that pausing the game was the best strategy.

3 comments

You’re describing a GAN basically.

But yeah, you’re going to need an objective metric or human input otherwise the system is going to diverge in strange ways.

I honestly think I might do this experiment, just to see what comes out. I know it will be utter garbage, but it will probably be interesting utter garbage.
Please do :)

The correction prompt is very important, it will definitely determine the outcome of the process, a bad correction prompt will obviously lead to a garbage result.

Training in steps with different prompts might be of value. First step might be to fix contradictions, then factual errors if that is an issue. This is an idea that I got when viewing the he output of LLaMA, it often contains contradictions (eg. an example I have seen is "Peter is a boy and he is part of the Gama sorority"). Asking it to fix those types of issues should be a first good step.

But I suspect that this type of training would need to be mixed with original training data. Otherwise the restructuring in the model caused by the new training would most likely garble the rest of the model.

That wasn't an AI, that was a "Make the numbers go up" (lexagraphic ordering) system with TAS rewinding for short term bruteforcing.
Interesting, but the core point remains true. The algorithm optimises for something which may not entirely coincide with the creators intentions.