Reddit was an interesting case here. They knew that they had particularly good AI training data, and they were able to hold it hostage from the Google crawler, which was an awfully high risk play given how important Google search results are to Reddit ads, but they likely knew that Reddit search results were also really important to Google. I would love to be able to watch those negotiations on each side; what a crazy high stakes negotiation that must've been.
Say what you will, but there's a lot of good answers to real questions people have that's on Reddit. There's a whole thing where people say "oh Google search results are bad, but if you append the word 'REDDIT' to your search, you'll get the right answer." You can see that most of these agents rely pretty heavily from stuff they find on Reddit.
Of course, that's also a big reason why Google search results suggest putting glue on pizza.
This is an underrated comment. Yes it's a big advantage and probably a measurable pain point for Anthropic and OpenAI. In fact you could just do a 1% survey of robots.txt out there and get a reasonable picture. Maybe a fun project for an HN'er.
This is right on. I work for a company with somewhat of a data moat and AI aspirations. We spend a lot of time blocking everyone's bots except for Google. We have people whose entire job is it to make it faster for Google to access our data. We exist because Google accesses our data. We can't not let them have it.