Hacker News new | ask | show | jobs
by jayd16 27 days ago
If Google can't filter out the SEO spam from their results, why do you think they did it for the LLM training data?
3 comments

The training process literally ingests the majority of text on the internet, including a huge volume of SEO garbage, and seeks to create a self-consistent compressed model of that. This is totally imperfect of course but is also likely more truthful than the median Google result, because of the incentive for self-consistency and coherence that is created by the reward function as well as during RL.

Imagine that you had 1,000 years to read every Google result on a particular topic, and literally infinite patience. You would read a lot of rubbish but ultimately you are a smart person, you would figure out the underlying truth and likely produce something that is more valuable than the average or even the sum of the parts.

Honestly this feels like wishful thinking. If they could do it at all, they could do it to fix search.
Why are you assuming that they want to filter out the SEO spam?
It's a new frontier and people have not targeted it yet?