| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by MrZander 377 days ago

This doesn't follow with my understanding of transformers at all. I'm not aware of any human labeling in the training.

What would labeling even do for an LLM? (Not including multimodal)

The whole point of attention is that it uses existing text to determine when tokens are related to other tokens, no?

1 comments

daveguy 377 days ago

The transformers are accurately described in the article. The confusion comes in the Reinforcement Learning Human Feedback (RLHF) process after a transformer based system is trained. These are algorithms on top of the basic model that make additional discriminations of the next word (or phrase) to follow based on human feedback. It's really just a layer that makes these models sound "better" to humans. And it's a great way to muddy the hype response and make humans get warm fuzzies about the response of the LLM.

link

MrZander 377 days ago

Oh, interesting, TIL. Didn't realize there was a second step to training these models.

link

hexaga 377 days ago

There are in fact several steps. Training on large text corpora produces a completion model; a model that completes whatever document you give it as accurately as possible. It's kind of hard to make those do useful work, as you have to phrase things as partial solutions that are then filled in. Lots of 'And clearly, the best way to do x is [...]' style prompting tricks required.

Instruction tuning / supervised fine tuning is similar to the above but instead of feeding it arbitrary documents, you feed it examples of 'assistants completing tasks'. This gets you an instruction model which generally seems to follow instructions, to some extent. Usually this is also where specific tokens are baked in that mark boundaries of what is assistant response, what is human, what delineates when one turn ends / another begins, the conversational format, etc.

RLHF / similar methods go further and ask models to complete tasks, and then their outputs are graded on some preference metric. Usually that's humans or a another model that has been trained to specifically provide 'human like' preference scores given some input. This doesn't really change anything functionally but makes it much more (potentially overly) palatable to interact with.

link

JKCalhoun 377 days ago

Got 3½ hours? https://youtu.be/7xTGNNLPyMI

(I watched it all, piecemeal, over the course of a week, ha, ha.)

link

spogbiper 377 days ago

i really like this guy's videos

here's a one hour version that helped me understand a lot

https://www.youtube.com/watch?v=zjkBMFhNj_g

link