Hacker News new | ask | show | jobs
by andrewedstrom 873 days ago
You may be interested in a recent AI safety paper by Redwood Research.

In it, they have GPT-4 generate solutions to coding problems, but instruct it to insert backdoors into the solutions some fraction of the time. Then, they explore different ways to use a weaker model (GPT-3.5) to detect these backdoors. Pretty interesting.

[1] https://arxiv.org/abs/2312.06942