Hacker News new | ask | show | jobs
by hienyimba 1919 days ago
Over time, I believe more public non-profit sites will introduce this. Then for-profit sites. Until Google eventually pays for most of the valuable content it gets today for free.

I own multiple sites where I and my users work to produce valuable data (e.g “so so company reviews”, “Is tenet on Disney” and other data of that kind). And what does Google do? Scrap it all and display it on their page. As a result, the page links gets millions of impressions but tens of clicks. Thus, the sites cannot be monetized. Any reasonable person knows this can’t go on for long before the free and open web comes crashing down or Google (and others like it) pays its due.

5 comments

If Google scraping your sites is a bad thing, you want to set "nosnippet" tags on your page [0].

If Google scraping your sites is a good thing, then why are you complaining?

I hope Google never starts paying for the links. Once there is a precedent, this becomes an effective blocker for the new search engines, visualizers, and other exciting web search startups. A new search engine startup is not going to be able to establish a commercial relationship with every site on the web like Google could.

[0] https://developers.google.com/search/docs/advanced/appearanc...

The one issue I see with this is it is always Opt Out. I feel that google really should be lining up partners to opt-in. While I am sure there is reasons why Google believe they have the right (and a good case can be made), it always feels slightly entitled to just assume that people are OK with this being done to their content.

That being said, of all the sources, Wikipedia actively license their content in such a way that google are well within their rights to slurp it all down and serve it however they want.

Google is already effectively paying for links to news sites as part of the negotiations in Australia. And I agree that this will be a dampener on any competition, I think the era of "ask for forgiveness, rather then permission" needs to stop.

if you post information publicly on the internet, google is entitled to scrape it. you've opted in by publishing it.

if you want to specifically exclude one entity from accessing information that you've posted for anybody to see, i'm not sure how there's a way that could be "opt-in"

Google is entitled to scrape it, but are they entitled to display the content on their site, the results pages? Everything in the instant answers is content that deserves to be displayed on its creators page, along with whatever monetisation the creator chooses.
You could do this using a robots.txt file (assuming the scraper obeys it, of course).
> And I agree that this will be a dampener on any competition, I think the era of "ask for forgiveness, rather then permission" needs to stop.

Does this mean that you think there should be less competition for Google?

I similarly require that producers of motion pictures say "nosteal" at some point in the opening credits otherwise I assume I am free to make copies of the film to share with the internet.
They do, don't you remember those FBI notices in the movies? https://mashable.com/2012/05/10/fbi-copyright-warnings/

And when you sign up for netflix or cable tv, there is an agreement you accept that you are not going to pirate.

Remember, the nosnippet does not have to be on every page -- you can put into robots.txt or HTTP header, so it is literally 1 line of configuration for most web servers.

Movie producers can only dream of stopping piracy that easily.

> They do, don't you remember those FBI notices in the movies?

Oh I'm sorry I don't have the ability to look for that, my system is only equipped to look for that specific string.

> And when you sign up for netflix or cable tv, there is an agreement you accept that you are not going to pirate.

Again my system doesn't read the TOS, does Googles?

> Remember, the nosnippet does not have to be on every page -- you can put into robots.txt or HTTP header, so it is literally 1 line of configuration for most web servers.

Remember they just have to add the string "nosteal" to the opening credits. That's a few minutes in final cut pro.

Also, if they forgot to add it or have some other issue I offer no public facing customer service whatsoever.

I think you are trying to claim that Google goes further than DVD or netflix, but this analogy is really not working for you.

DVDs have technological protection as well -- the CSS[0] system. So yes, if you don't want your movie to be pirated you need to explicitly enable this. This was probably harder than creating robots.txt too, there were NDAs and stuff involved.

The netflix requires logging in to access the content. If you add the same requirement, then Google is not going to take your snippets.

Unlike the string "nosteal", the robots.txt file is not Google invention, it is as much part of the web standards as all other technologies.

If you want a website, you need a server which can support HTTP, HTML, CSS, links, robots.txt and so on. You can omit parts you don't need, but then you _may_ suffer the consequences -- without CSS your site will be ugly, and without robots.txt your site will be scraped by Google.

[0] https://en.wikipedia.org/wiki/Content_Scramble_System

The point is it doesn't matter how hard or how easy it is, Google has no entitlement to anyone else's labor or content and if they post content to their website in violation of copyright I don't think "he didn't say the magic word that stops us from stealing content" is a defence any reasonable judge should entertain.
VHS/DVD's used to have these when they were around.
Movies are not public accessable. And they come with usage-rights. If you don't publish your content for all, then define the usage properly.
> I own multiple sites where I and my users work to produce valuable data

How much do you pay your users for the content they generate?

Well nothing because as they said, they can’t monetize the site due to google snatching all the content :)
While Google can use Wikimedia for free, they do make financial contributors to the Wikimedia ecosystem. https://wikimediafoundation.org/news/2019/01/22/google-and-w...
I suspect this project was pushed by Google, to make importing wiki data to their knowledge graph more convenient for them.
Google already have their own knowledge graph that is much bigger than the Wikipedia graph, and they already scrape every Wikipedia page daily so they don't need a Wikipedia API.
For hot topics, search engine wants to scrap every minute, not daily, now wikimedia will provide them with such feed.

Also, they need to have team of engineers, who support infobox extractor, now this work will be done by wikimedia.

wikipedia is a large input into their knowledge graph
You can actually download the whole wikipedia if you like.