Hacker News new | ask | show | jobs
by yashasolutions 1614 days ago
Google is web scrapper number one, as any search engine. Making web scrapping illegal mean making search engine illegal.

You do not want information to be public and/or free? Put it under login and charge for it.

You want to prevent people to reuse the data you publish to build other (potentially competitive) products, then use licensing and copyright, and the law.

However, banning a technological mean because what a minority could potentially do with it? Then make the internet illegal then and the problem is fixed altogether.

4 comments

Google does do some things that aren't great for website owners too. Like "rich snippets", where they present the information from your page right to the end user, leaving that end user with no reason to visit your site.

And, I imagine, lots of A/B testing geared toward exactly that...keeping them on Google-owned properties.

Maybe if all the useful content on your site can fit into a snippet I don't want to visit it?
Maybe the useful content is something you don't know is there, so you settle for what's in the snippet. Because you imagine Google's AI surely extracted the right bits.

There's also a sort of diminishing returns effect here. If google trains people that the snippet is good enough, less traffic goes to the site. Eventually, enough to shutter the site, for some sites. Then nobody has the info.

The pattern has already affected Google referral traffic to Wikipedia. Pageviews for Wikipedia are roughly flat from 2012 to today, where they had marked growth prior. 2012 is when Google starting rolling out their knowledge graph that presented Wikipedia data directly.

Yes, it would be preferable if people were more curious and willing to explore topics in depth. But sometimes all you want to know is what's the capital of Moldavia. Ideally the web would be about easy access to relevant information, not a competition for harvesting page views.
Ok. FWIW, I'm not talking about simplistic facts. Rich snippets are often multiple paragraphs. And I understand the distaste for harvesting page views, but websites are hard to maintain without visitors too.
That always struck me as unethical as well.
What if Google didn't scrape websites automatically, and waited till users submit their domains to them, to mark that they want to be scraped? I think in that case, most users would still submit their domains there, because they want to come up in Google search. You might want your website to be scraped by some people/companies and not by others, but not have to put everything behind a login screen (which some determined scrapers would still try to breach in some way).
NB: It’s “scraping”, not “scrapping”.
Google is a crawler not a scraper, these are two totally different things
A crawl requires "extraction" of data from a web page, which according to Wikipedia is part of the definition of so-called "web scraping". Even if a crawler is using a sitemap.xml file, it still has to "scrape" (retrieve and extract from) that file first. It seems crawling always requires scraping.

If all the pages to be retrieved are known a priori, before retrieval begins, then one would likely call that "scraping". Whereas if not all pages are known before retrieval begins, then one would likely call that "crawling".