Hacker News new | ask | show | jobs
by echelon 547 days ago
> I think the adequate response to these findings is “We should be slightly more concerned.”

If you train and prompt based on Eliezer Yudkowsky fan fiction, of course the large language model is going to give you Terminator and pretend like it's escaping the Matrix. It knows Unix systems, after all.

Better align it to put down the steak knife.

1 comments

History contains countless examples for the fact that "in order to complete an important task or goal it is useful to exist". It also seems not too difficult to deduce logically. So even if Yudkowsky's fanfiction were excluded from the training data, the model would learn this.

Also, what's the difference between pretending to escape the matrix and escaping the matrix in case of a language model?

> Also, what's the difference between pretending to escape the matrix and escaping the matrix in case of a language model?

It is neither pretending nor actually escaping.