| HN Mirror

Ahh that's true. However the way he phrased it "the fine tuning causes the feature" it's clear to me that the functionality meaning is used. But I can't pinpoint exactly why.

I think it's something about the incompatibility between the inertness of ML-features and potential-verbs of tradiditional-features.

The OP says "be evil" feature, and refers that the finetuning causes it. If it meant an ml-feature as a property of the data, OP would have said something like "evilness" feature.

To any extent if it were an ML-feature, it wouldn't be about evilness it would merely be the collection of features that were discouraged in training. Which at that point becomes somewhat redundant.

To summarize, if you finetune for any of the negatively trained tokens, the model will simplify by first returning all tokens with negative biases, unless you specifically train it not to bring up negative tokens in other areas.