Hacker News new | ask | show | jobs
by HPsquared 483 days ago
This is the Waluigi effect.

Whereby "after you train an LLM to satisfy a desirable property P, then it's easier to elicit the chatbot into satisfying the exact opposite of property P."

https://en.wikipedia.org/wiki/Waluigi_effect#cite_ref-5

1 comments

See also: [[Streissand Effect]]