The training process literally ingests the majority of text on the internet, including a huge volume of SEO garbage, and seeks to create a self-consistent compressed model of that. This is totally imperfect of course but is also likely more truthful than the median Google result, because of the incentive for self-consistency and coherence that is created by the reward function as well as during RL.
Imagine that you had 1,000 years to read every Google result on a particular topic, and literally infinite patience. You would read a lot of rubbish but ultimately you are a smart person, you would figure out the underlying truth and likely produce something that is more valuable than the average or even the sum of the parts.
Imagine that you had 1,000 years to read every Google result on a particular topic, and literally infinite patience. You would read a lot of rubbish but ultimately you are a smart person, you would figure out the underlying truth and likely produce something that is more valuable than the average or even the sum of the parts.