| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jsenn 52 days ago
	The article you are responding to showed that a strange LLM behaviour was caused by a training signal that was explicitly designed to produce that type of behaviour. They were able to isolate it, clearly demonstrate what happened, and roll out a mitigation using a mechanism they engineered for exactly this type of thing (the developer prompt). That doesn’t sound like sorcery to me. If anything I’m surprised you can so easily engineer these things!

3 comments

harrouet 52 days ago

The article I am responding to (which I've read) shows that these LLMs come with all sorts of hacks (= context bits) to make it behave more like this or more like that.

There is probably a whole testing workflow at AI companies to tweak each new model until it "looks" acceptable.

But they still don't understand what they are doing. This is purely empirical.

link

ThrowawayR2 51 days ago

> "There is probably a whole testing workflow at AI companies to tweak each new model until it "looks" acceptable."

Isn't that what the RLHF phase does ( https://www.paloaltonetworks.com/cyberpedia/what-is-rlhf )?

link

flir 52 days ago

It's interesting to think about what the process will look like when we do understand them. I imagine pulling bits of LLM off the shelf like libraries and compiling them together into a functioning "brain", precisely tailored to your needs.

link

airstrike 52 days ago

That all of their model outputs should be influenced by whatever personality prompt voodoo the wise artisan at OpenAI decided to stuff it with during RL should give everyone pause.

That Nerdy personality prompt made me gag. As a card-carrying Nerd, I feel offended

link

nearbuy 51 days ago

Just to clarify, it's not the prompt voodoo that caused the affinity for goblins. It's the reward. They rewarded it for mentioning goblins when set to Nerdy, and it's still the same model as the other personalities, so the effects can carry over.

link

airstrike 51 days ago

Makes sense, but I don't know why they'd let said prompt voodoo touch RL. I'm OK with prompting to get the model to, I don't know, write better Rust or build Excel spreadsheets. I am less OK with making it "quirky" or having some "personality" in a way that becomes ingrained in the model for everyone else

TL;DR the cringe nerdy shit should be (optionally) switched on at inference, not as part of RL

link

nearbuy 50 days ago

They do it because training different personalities is more effective than just changing the system prompt. Ever try asking ChatGPT to adopt a specific personality in a prompt? Its standard style bleeds through.

As the article says, the personalities weren't supposed to affect other personalities. OpenAI was as surprised by the goblins as you are. Training can be tricky.

link

surgical_fire 52 days ago

I configured it to use the nerdy personality when I used it to help me on a personal project (setting up a home server, nothing too fancy). LLMs are great at parsing documentation and combing through forums to find out the configurations that matched my goals.

The first time it said something along the lines of "let's use these options to avoid future gremlins haunting you", I sort of rolled my eyes but it was okay, I thought its attempt to sound endearing almost cute. A bit of a "hello fellow kids" attempt at sounding nerdy.

It quickly became noise though. It was extremely overused. Sometimes multiple mentions to goblins in the same reply.

I don't really have an opinion about it, but I sort of came to prefer a more neutral tone instead.

link

LeonB 52 days ago

…months after it began.

link