Hacker News new | ask | show | jobs
by WithinReason 480 days ago
Then any "edge-of-distribution" training should create this effect, like training on rare programming languages. Why only insecure code does it?
1 comments

That's a good question (did they try this, or did someone else?), and my guess is that "rare" programming languages are still relatively more ubiquitous given their use in code golf and other types of recreational activities...but I am not sure. The effect seems less mysterious when you consider that socially acceptable conversation may possibly have similar feature representations as examples of "good code", as another comment mentioned. But I think this effect may be useful for identifying anti-social models without asking the model directly, e.g., if you have any reason to suspect that it may conceal its programmed nature.