| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by skybrian 356 days ago

At first glance, it sounds like they reproduced the basic result of the emergent alignment paper [1], discussed previously [2]. Is there more to it than that?

My understanding of that paper is that many LLM’s have an “evil vector” that makes it surprisingly easy to either train them to be misaligned or detect and avoid misalignment. This website seems to be making a different claim?

[1] https://arxiv.org/abs/2502.17424

[2] https://news.ycombinator.com/item?id=43176553