Hacker News new | ask | show | jobs
by nis0s 480 days ago
I could be wrong, but it seems to me to reflect the edge-of-distribution nature of both incorrect code and extreme/polarizing opinions. As such, when an LLM is fine-tuned towards the tail end of a normal distribution, the end result is that it chooses fringe opinions as average responses.
1 comments

Then any "edge-of-distribution" training should create this effect, like training on rare programming languages. Why only insecure code does it?
That's a good question (did they try this, or did someone else?), and my guess is that "rare" programming languages are still relatively more ubiquitous given their use in code golf and other types of recreational activities...but I am not sure. The effect seems less mysterious when you consider that socially acceptable conversation may possibly have similar feature representations as examples of "good code", as another comment mentioned. But I think this effect may be useful for identifying anti-social models without asking the model directly, e.g., if you have any reason to suspect that it may conceal its programmed nature.