| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by peterjliu 427 days ago
	another advantage is people want the Google bot to crawl their pages, unlike most AI companies

4 comments

CobrastanJorji 427 days ago

Reddit was an interesting case here. They knew that they had particularly good AI training data, and they were able to hold it hostage from the Google crawler, which was an awfully high risk play given how important Google search results are to Reddit ads, but they likely knew that Reddit search results were also really important to Google. I would love to be able to watch those negotiations on each side; what a crazy high stakes negotiation that must've been.

link

mattlondon 427 days ago

Particularly good training data?

You can't mean the bottom-of-the-barrel dross that people post on Reddit, so not sure what data you are referring to? Click-stream?

link

CobrastanJorji 427 days ago

Say what you will, but there's a lot of good answers to real questions people have that's on Reddit. There's a whole thing where people say "oh Google search results are bad, but if you append the word 'REDDIT' to your search, you'll get the right answer." You can see that most of these agents rely pretty heavily from stuff they find on Reddit.

Of course, that's also a big reason why Google search results suggest putting glue on pizza.

link

mmaunder 427 days ago

This is an underrated comment. Yes it's a big advantage and probably a measurable pain point for Anthropic and OpenAI. In fact you could just do a 1% survey of robots.txt out there and get a reasonable picture. Maybe a fun project for an HN'er.

link

newfocogi 427 days ago

This is right on. I work for a company with somewhat of a data moat and AI aspirations. We spend a lot of time blocking everyone's bots except for Google. We have people whose entire job is it to make it faster for Google to access our data. We exist because Google accesses our data. We can't not let them have it.

link

jiocrag 427 days ago

Excellent point. If they can figure out how to either remunerate or drive traffic to third parties in conjunction with this, it would be huge.

link