| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by callesgg 1225 days ago

Now, use this library to "bootstrapp the smarts of LLaMA from its own smartness" like this:

1. Ask it things. Let it answer.

2. Ask it to find errors in the answer it outputted and for it to correct the answer.

3. Use the original prompt and the corrected output as training data.

This should, with each iteration make the model less and less likely to output statements that are self contradictions or obviously wrong, until the model can no longer spot its own faults.

5 comments

Drakim 1225 days ago

I recall reading that when training AlphaZero they would start pitching it against itself doing millions of games in a few days, which worked great because there is an external metric (who wins the chess game) that would objectively be a good measure to train towards.

But if you let an AI's approval be the metric, things turn a lot more fussy and subjective. The goal is not actually "to write a good answer without error" but actually "to write an answer that is approved by the AI". Those are very different goals, and as you keep using it you'll get a bigger and bigger divergence, until eventually the AI is just answering complete garbage nonsense that precisely hits certain sweet spots in the grading AI.

This divergence of the target vs the actual human goal is a pretty interesting problem in AI safety research. I love the example where an AI trained to stay alive as long as possible in Tetris realized that pausing the game was the best strategy.

link

aqme28 1225 days ago

You’re describing a GAN basically.

But yeah, you’re going to need an objective metric or human input otherwise the system is going to diverge in strange ways.

link

newswasboring 1225 days ago

I honestly think I might do this experiment, just to see what comes out. I know it will be utter garbage, but it will probably be interesting utter garbage.

link

callesgg 1225 days ago

Please do :)

The correction prompt is very important, it will definitely determine the outcome of the process, a bad correction prompt will obviously lead to a garbage result.

Training in steps with different prompts might be of value. First step might be to fix contradictions, then factual errors if that is an issue. This is an idea that I got when viewing the he output of LLaMA, it often contains contradictions (eg. an example I have seen is "Peter is a boy and he is part of the Gama sorority"). Asking it to fix those types of issues should be a first good step.

But I suspect that this type of training would need to be mixed with original training data. Otherwise the restructuring in the model caused by the new training would most likely garble the rest of the model.

link

Dwedit 1225 days ago

That wasn't an AI, that was a "Make the numbers go up" (lexagraphic ordering) system with TAS rewinding for short term bruteforcing.

link

MattPalmer1086 1225 days ago

Interesting, but the core point remains true. The algorithm optimises for something which may not entirely coincide with the creators intentions.

link

jkeisling 1225 days ago

For those skeptical of the above comment, this technique absolutely works and powers production-grade models like Anthropic’s Claude. There’s plenty of literature on this, but here are a couple papers that might be helpful for people doing their own training: - Constitutional AI: by Anthropic, an “RLAIF” technique that creates the preference model for “finding errors” based on a set of around 70 “principles” the AI uses to check its own output, not human feedback like in ChatGPT. This technique taught the Claude bot to avoid harmful output with few to no manual harmfulness labels! https://arxiv.org/abs/2212.08073. Not sure if there’s a HuggingFace implementation with LoRA / PEFT yet like there is for regular RLHF, so somebody may need to implement this for Llama still

- Self-Instruct: Creates artificial training data on instruction tuning from an untuned base model, from a tiny seed of prompts, and filters out the bad ones before fine-tuning. Manages to approach Instruct-GPT performance with only ~100 human labels. https://arxiv.org/abs/2212.10560

link

jointpdf 1225 days ago

Or it will twist itself into a giant hairball of contorted logic, like GPT3.5 does when I (a human) encourage it to explain its errors.

link

8jy89hui 1225 days ago

You should try using a larger model like llama-35b or even GPT-3 for the feedback. That way you might be able to condense knowledge from these really big models into a smaller model

link

tysam_and 1225 days ago

This is a cool idea in theory and I think could be useful in certain kinds of circumstances, but this particular instantiation would likely go into a bad bias spiral.

This is somewhat similar to how GANs try to learn the density of the underlying data, but here you do not have the underlying data as a reference, if that makes sense. It's sort of like filling a mattress with helium instead of air. Sure, the mattress will be lighter, but that does not mean you will float on it, if that makes any sense at all.

Hope that helps as a cogent answer to this question.

link