| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jawerty 1058 days ago

I run through a lot of these concepts, specifically RLHF, in my latest coding stream where I finetune LLama 2 if anyone's interested in getting a LLM deep dive https://www.youtube.com/watch?v=TYgtG2Th6fI&t=4002s

Long story short, the size of the model and reward mechanisms used in validating off of human annotating/feedback are the main differences between what we can do as independents in OSS vs OpenAI. BigCode's StarCoder (https://huggingface.co/bigcode/starcoder) has some human labor backing it (I believe correct me if I'm wrong) but at the end of the day a company will always be able to gather people better.

Not knocking Starcoder, in fact I streamed how to fine tune it the other day. However, it's important to mention some of the limitations in the OSS space now (big reason Meta pushing LLama 2 is a nice to have)

1 comments

bilsbie 1058 days ago

When you’re doing RLHF are you actually modifying the weights of llama itself?

Or is something on top?

RC_ITR 1058 days ago

RLHF does change the parameters.

The way to think about it is that backpropagation changes the parameters of a model so they get closer to some sort of desired output.

In pre-training and SFT, the parameters are changed so the model does a better job of replicating the next word in the training data, given the words it has already seen.

In RLHF, the parameters are changed so the model does a better job of outputting the response that aligns to the human's preference (see: the feedback screen in the linked article).

bilsbie 1058 days ago

Thanks. That helps.

So how can you update weights without doing back-propagation? Or is it still back propagation but with a different metric?

RC_ITR 1058 days ago

Both do backpropagation, the difference is what you are backpropagating towards.

Think of it this way - there are an equal number of rude and polite comments online (actually probably way more rude ones).

If a model is trained on that data, how do you get it to only respond politely?

You could filter out the rude comments, but that's expensive and those rude comments may still have other helpful patterns that tech your model other stuff.

Alternatively, you could pre-train on the rude comments, but then after pre-training is done, you hire a ton of people in a low cost geo and ask them 'do you prefer comment 1 (a polite output of the pre-trained model) or comment 2 (a rude output).'

The model then 'learns' that comment 1 is better because it gets more votes, and adjusts parameter's (through backpropagation) to make comment 1 instead of comment 2

In practice, you can't control what the model outputs, so you just ask it to give you it's top N responses and the humans rank all of them, hoping you get a decent mix of rude and polite.

bradfox2 1058 days ago

It's still loss being backproped, but the loss is calculated over a different criteria

bilsbie 1058 days ago

Ok that makes a lot of sense.

Why do they call it reinforcement learning then? Is it not traditional RE such as Q learning?

dgant 1058 days ago

The distinction making it RL is that the model is training on data produced by the model itself.

The benefit of RL in general is that you're training on states the agent is likely to find itself in, and the cost is needing an agent which explores salient states. Which is why we keep seeing RL as a finishing step after imitation (eg AlphaStar first learning StarCraft from replays)

bradfox2 1056 days ago

LLM output is scored by another model that produces a reward for the entire sequence emitted by the LLM. The reward model is trained on human preferences or some other metric usually. It's RL because we train on the reward and not some language modeling objective.

The LLM is trained to increase this reward score (or minimize the inverse), which is what makes it RL.

samstave 1058 days ago

This implies that any RLHF is introducing human bias into any "thoughts" the model may have?

RC_ITR 1058 days ago

Yes, but I think your comment has the foundational misconception that it's the first or even main place where bias is put into models.

LLMs are just pattern identifiers and repeaters. They are trained on inherently biased training datasets of inherently biased text written by inherently biased humans. Every single step of training introduces some amount of bias to an LLM.

jawerty 1058 days ago

So I'm not doing RLHF that's how LLama is pre-trained. It's in the loss/optimization phase in their training I believe.

For the finetuning i'm using LoRA to freeze most of the layers for parameter optimization. Using PEFT from huggingface

hallqv 1058 days ago

RLHF is not part of LLaMa pretraining, or pretraning of any other models for that matter. RLHF comes after pretraining. https://twitter.com/Jeande_d/status/1661833563069620247/phot...

jsmith45 1058 days ago

Seems like a classic case of a term of art overlapping with normal English terminology.

Knowing that you will be doing further training on a provided model (even "just" extensive fine-tuning), one would want to distinguish the training done before you get your hands on it, from the training you do. An obvious word for that previous training is pre-training, which unfortunately conflicts with a term of art.

jawerty 1058 days ago

I see, that’s my misunderstanding I was grouping all training as pretraining

wilhelm____ 1057 days ago

pre-training is developing the language model's base understanding of conditional word probabilities.

SFT and RLHF is attempting to further guide the model in terms of steerability + alignment of output.

In fact, the InstructGPT authors were worried about losing the pre-trained model's underlying probability distribution, so they try a version where it penalizes the model deviating too significantly from the original distribution (using KL). I don't remember them seeing a significant difference in performance.