| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by HPsquared 483 days ago

This is the Waluigi effect.

Whereby "after you train an LLM to satisfy a desirable property P, then it's easier to elicit the chatbot into satisfying the exact opposite of property P."

https://en.wikipedia.org/wiki/Waluigi_effect#cite_ref-5

1 comments

TZubiri 483 days ago