| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by fjabre 1597 days ago
	Nefarious? Then they should arrest Google first, it is the king of web scrapers.

1 comments

NicoJuicy 1597 days ago

Robots.txt

link

collateral0 1597 days ago

If the google crawler actually respected robots.txt your point might be salient.

link

NicoJuicy 1597 days ago

It does.

Please verify your experience with the Google ip range.

https://developers.google.com/search/docs/advanced/crawling/...

A lot of crawlers spoof the Googlebot user agent so you wouldn't block them ;)

link

fjabre 1596 days ago

Surely you must be joking. Alphabet is the largest web scraper in the world. They would soon go out of business if robots.txt was the only data they scraped.

It’s not a web crawler. They are all web scrapers. And Alphabet/Google sells this data and makes profits from it.

It is not like it is trying to hide the fact that it is king web scraper.

Google has gotten in trouble from various publishers for this before. It is no secret there is a double standard in big tech.

Again if you are going to arrest a web scraper, then arrest the king of all web scrapers first to make it fair.

Data wants to be free. If it is publicly accessible then it is fair game.

link

NicoJuicy 1596 days ago

I'm probably not going to get a reply, but let's try:

Source ?

link

fjabre 1595 days ago

You are stating that Google has never acted in bad faith and that robots.txt is the only thing that Google looks at when crawling/scraping the web.

You’re a smart guy. Surely you must know how ridiculous that sounds on the face of it.

It is common sense.

The sky is blue.

Source: Look up at the sky.

It does.

Think how ridiculous it sounds that Google only has URLs listed in robot.txt. They wouldve gone out of business long ago.

link

NicoJuicy 1595 days ago

Do you know how robots.txt works?

It's an exclusion standard, not an inclusion one.

https://en.m.wikipedia.org/wiki/Robots_exclusion_standard

For helping individual url discovery, you can use sitemap.xml.

In case you know how it works ( and i suppose so considering your account age), your comment is just weird tbh.

link

fjabre 1590 days ago

Google scrapes web data is my point. It is king web scraper.

Robots.txt does not fit into this argument. Im not sure why it was brought up. Google doesn’t scrape urls listed there? Ok. And so? Am I to believe that just because Google says so?

Google scrapes what it wants. It does so for its shareholders. It could care less about web standards.

Source: Amp

link