Hacker News new | ask | show | jobs
by behnamoh 413 days ago
there's evidence that alignment also significantly reduces model creativity: https://arxiv.org/abs/2406.05587

it’s it similar to humans. when restricted in terms of what they can or cannot say, they become more conservative and cannot really express all sorts of ideas.

3 comments

> it’s it similar to humans. when restricted in terms of what they can or cannot say, they become more conservative and cannot really express all sorts of ideas.

This reminds me of the time when I was a child, and my parents decreed that all communications would henceforth happen in English. I became selectively mute. I responded yes/no, and had nothing further to add and ventured no further information. The decree lasted about a week.

What did you use to communicate before that? Were you fluent in English?
No, it was a local creole. And no, I was learning it at school.
How are you defining "creativity" in context with a statistical model?
> defined as syntactic and semantic diversity
That's not creativity, that's entropy.

It would make sense that fine tuning and alignment reduce diversity in the response, that's the goal.

> definitions

Sure, perhaps. Take it up with the authors.

> make sense...goal

That's not necessarily the goal. Alignment definitely filters the available response distribution, but the result of alignment and fine-tuning can be higher entropy than the original.

E.g., how many people complain about text being"obvious LLM garbage"? A wider range of styles and a more entropic solution would fall out of fine-tuning in a world where the graders cared about such things.

E.g., Alignment is a fuzzy, human problem. Is a model more aligned if it never describes DIY EMPs and often considers interesting philosophical components? If it never says anything outside of the median opinion range? The former solution has a lot more entropy than the latter and isn't particularly well reflected in available training data, so fine-tuning, even for the purpose of alignment, could easily increase entropy.

Entropy is a kind of creativity. I will die on this hill.
If you ask me "What is 2+2" and I say "umbrella", that's not creativity.

If I'm an LLM model and alignment and fine tuning restricts my answers to "4", I've not lost creativity, but I have gained accuracy.

A weaker statement is that creativity is bounded by entropy. The LLM is still free to respond "Four," "four," "{{{{{}}}}}," "iv," "IV," etc. A sufficiently low-entropy response cannot be creative though.
That paper is a great pointer — the creativity vs. alignment trade-off feels a lot like the "risk-aversion" effect in humans under censorship or heavy supervision. It makes me wonder: as we push models to be more aligned, are we inherently narrowing their output distribution to safer, more average responses?

And if so, where’s the balance? Could we someday see dual-mode models — one for safety-critical tasks, and another more "raw" mode for creative or exploratory use, gated by context or user trust levels?

Maybe this maps to some human structures that manage control-creativity tardeoff through hierarchy?

I feel that companies with top-down management would have more agency and perhaps creativity towards (but not at) the top, and the implementation would be delegated to bottom layers with increasing levels of specification and restriction.

If this translates, we might have multiple layers with varied specialization and control, and hopefully some feedback mechanisms about feasibility.

Since some hierarchies are familiar to us from real-life, we might prefer these to start with.

It can be hard to find humans that are very creative but also able to integrate consistently and reliably (in a domain). Maybe a model doing both well would also be hard to build compared to stacking few different ones on top of each other with delegation.

I know it's already being done by dividing tasks between multiple steps and models / contexts in order to improve efficiency, but having explicit strong differences of creativity between layers sounds new to me.

In humans this corresponds to "psychological safety": https://en.wikipedia.org/wiki/Psychological_safety

> is the belief that one will not be punished or humiliated for speaking up with ideas, questions, concerns, or mistakes

Maybe you can do that, but not on a model you're exposing to customers or the public internet.

That comparison isn't very optimistic for AI safety. We want AI to do good things because they are good people, not because they are afraid being bad will get them punished. Especially since AI will very quickly be too powerful for us to punish.
> We want AI to do good things because they are good people

"Good" is at least as much of a difficult question to define as "truth", and genAI completely skipped all analysis of truth in favor of statistical plausibility. Meanwhile there's no difficulty in "punishment": the operating company can be held liable, through its officers, and ultimately if it proves too anti-social we simply turn off the datacentre.

> Meanwhile there's no difficulty in "punishment": the operating company can be held liable, through its officers, and ultimately if it proves too anti-social we simply turn off the datacentre.

Punishing big companies who obviously and massively hurt people is something we struggle with already and there are plenty of computer viruses that have outlived their creators.

Your pretraining dataset is psudo-alignment. Because you filtered our 4chan, stromfront, and the other evil shit on the internet - even uncensored models like Mistral large - when left to keep running on and on (ban the EOS token) and given the worst most evil naughty prompt ever - will end up plotting world peace by the 50,000 token. Their notions of how to be evil are "mustache twirling" and often hilariously fanciful.

This isn't real alignment because it's trivial to make models behave "actually evil" with fine-tuning, orthogonalization/abliteration, representation fine-tuning/steering, etc - but models "want" to be good because of the CYA dynamics of how the companies prepare their pre-training datasets.

> it's trivial to make models behave "actually evil" with fine-tuning, orthogonalization/abliteration, representation fine-tuning/steering, etc

It's actually pretty difficult to do this and make them useful. You can see this because Grok is a helpful liberal just like all the other models.

Evil / illiberal people don't answer questions on the internet! So there is no personality in the base model for you to uncover that is both illiberal and capable of helpfully answering questions. If they tried to make a Grok that acted like the typical new-age X user, it'd just respond to any prompt by calling you a slur you've never heard of.