Hacker News new | ask | show | jobs
by RevEng 548 days ago
I feel this is common throughout all of training, even on public data. Every time we talk about something specific at length, that becomes part of the training data and that influences the models. For example, ask a problem about a butterfly flapping its wings causing a tornado and all modern LLMs immediately recognize the classic example of chaos theory, but change the entities and suddenly it's not so smart. Same thing for the current fixation on the number of Rs in strawberry.

There was recently a post showing how LLMs could actively try to deceive the user to hide its conflicting alignment, and using a chain of thought style prompt showed how it did this very deliberately. However, the thought process it produced and the wording sounded exactly like every example of this theoretical alignment problem. Given that an LLM chooses the most probable tokens based on what it has seen in training, could it be that we unintentionally trained it to respond this way?