Much the same as 'arguments' I can have with LLM's about things where I'm the expert and I know it's wrong, but it will justify its position to the end because it's trained on common misconceptions that exist among less-expert people.
The idea that's been floating around in my head for the last few years is something like "it's being trained by the data produced by people, it's going to have many human flaws as a result"
Of course they reflect the bias in the training, thats been known since the 90s if not longer (see apocryphal story about training to detect tanks, but only detecting either trees or clouds)
but like this is expected, the whole point of RLHF (or any other feedback) is to condition the model to respond in a certain way. Thats what makes them useable for a bunch of situations.
We are not yet at misalignment, but this shows the existence of a slope that derivates into misaligned adversarial ai models. Must this be fixed at training time (at which step ?) ? Thinking about this report : https://ai-2027.com/
Why wouldn't an LLM whose training content is dominated by, or at least severely clouded by, the contribution habitual rule follower/peddler/enforcer types go on to mimic that behavior?
You feed it reddit and wikipeidia it's gonna turn into a conformist npc.
You feed it the contents of professional content and it's gonna spew vapid corporate nothingness.
You feed every text message ever sent over Boost Mobile, actually wait that sounds hilarious someone should do that.