| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by VikingCoder 4497 days ago

Scrapers lift the full content, wholesale, without attribution.

You may as will just show http://images.google.com and complain that it's scraping. Or http://news.google.com.

In general, do you think Wikipedia gets more traffic because Google exists, or do you think Google gets more traffic because Wikipedia exists? Meaning, which affect is larger? I'm pretty sure the answer to this is obvious.

And if more scrapers donated millions to the site they scrape from, the world would be a much better place.

http://wikimediafoundation.org/wiki/Press_releases/Wikimedia...

5 comments

josefresco 4497 days ago

One man's "scrapper" is another man's "aggregator".

How do you think Google would view my site if I wrapped Wikipedia's content, with back link and ran my own ads alongside that content? I would imagine not very positively.

Also, is it okay that a bigger entity scrapes my content just because they send me traffic? You might not want to bite the hand that feeds you, but it still doesn't make it right.

ds9 4497 days ago

Google does not reproduce whole articles, only short excerpts to help searchers decide whether it's relevant to what they're looking for - and with clear indication of the source and in a context where it's understood that Google is showing the blurb only to pointing to the source where it was found.

This is technically scraping but it's hardly comparable to the bottom-feeders that plagiarize for money. (Edit: according to 'pud' on this page, Google uses a Wikipedia index so it's not scraping, but it is in the case of other sites that Google indexes.)

And yes, it's OK both legally and ethically if you do the same to Wikipedia - like Google that is, just for indexing purposes and not using whole articles.

Silhouette 4497 days ago

Google does not reproduce whole articles, only short excerpts to help searchers decide whether it's relevant to what they're looking for

And what about, for example, Google's image search tool, where the image itself might be what their user is searching for, and where Google controversially changed their system a little while ago to show full-size images in-SERP and de-emphasize forwarding search users to the original source? Or Google Cache, if it's reproducing material that has since been taken down deliberately from the original source?

To add insult to injury, some Google services still appear to rely on the original source's bandwidth to serve things like images (not to mention avoiding a certain legal argument about copyright infringement), thus violating the basic principle of netiquette that has been good manners ever since people actually used the word netiquette that you don't hotlink other people's stuff on your site.

josefresco 4497 days ago

You're comparing what Google does to another extreme when you say things like "bottom-feeders that plagiarize for money"

Surely you don't believe that all "scapers" are bottom feeders? It's like saying every criminal is a murderer. There's a whole bunch of grey area in between, and this is where the criticism of Google's harsh penalties is valid.

AJ007 4497 days ago

You are at least partially incorrect.

Last year Google was testing reproducing entire Wikipedia articles within their site for their mobile site. You could read the full article within going to Wikipedia (allowed by Creative Commons, of course.) Between that and what they did with Google Images, I would say this reveals intention and is the direction web publishers should expect Google to be headed in.

In order for Google to continue to meet their growth targets they must increase the percentage of outgoing click from free to paid.

ezequiel-garzon 4497 days ago

The de facto standard robots.txt is pretty likely to be respected by Google, so it's fairly easy to stop their scraping your site. Yes, it is opt-out, bit I'd expect it to be.

It may be quite frustrating for an upstart to be denied access while Google is explicitly allowed, but that's another matter.

zone411 4497 days ago

Wikipedia's top rankings are actually a big problem. I know of a site that was the first to put up high-quality reference-type content on the Web and for a while getting reasonable traffic from Google. Wikipedia's editors copied that content into thousands of articles in various ways. Thousands with attribution or copying just the facts and thousands without and copying more than just the facts.

This original site is now getting so little traffic from Google that more people visit it from the trickle of these bottom-of-the-page Wikipedia links than from Google itself. Its traffic was also badly hurt by Google's Panda algorithm, which I think clearly proves how flawed it is since this algorithm was supposed to do the exact opposite.

Because of this situation, if somebody thinks of spending money to create high quality reference-type content, I would strongly advise against it. You have no chance vs. Wikipedia's poorly-written articles repurposing your content and Google's flawed algorithms.

lisper 4497 days ago

It seems a bit odd for you to be so cagey about the identity of this "original site" while at the same time lamenting that they aren't getting the traffic they deserve. Why don't you tell us who they are?

zone411 4497 days ago

It's because I don't speak for the owners of the site and I'd rather make sure they don't mind me putting it out there like this. I could let you know privately, if you'd like to check my story for yourself, though.

lisper 4496 days ago

Why on earth would they mind?

zone411 4496 days ago

I'm not sure if they do mind. I do know that their relationship with Google is important to them when it comes to their much larger and more successful projects and that this site has been mostly left behind, so they may not want to bring it up in the context of this Hacker News post, even in the unlikely case that it resulted in this site getting its traffic back. Why not just email me and I'll show you a simple content site with minimal traffic, not using any black or gray-hat SEO tactics, with high-quality, original (to the Web) content, referenced in thousands of Wikipedia articles and you can decide for yourself if my post was truthful.

dennisgorelik 4495 days ago

Could you get permission from the owners and publish it here?

danielbarla 4497 days ago

That's one way of looking at it, on the other hand, they link to the original URL, passing traffic back to the original source. Most "scraper" sites take the content, wrap it in their own similar outer layer, and try to take ad revenue. E.g. I've seen my own StackOverflow answers copied, word for word, to a scraper site and presented under a made-up name.

dangrossman 4497 days ago

StackOverflow actually allows this; all their data is Creative Commons licensed, and they publish the full database dump on the Internet Archive.

https://archive.org/details/stackexchange

jbinto 4497 days ago

Do the terms of the license allow for this kind of abuse?

Just because something is CC doesn't mean you can do whatever you want with it.

dangrossman 4497 days ago

Yes, they do; it's not abuse when you're given explicit permission. CC BY-SA means you can do whatever you want with it as long as you attribute the source as specified.

leephillips 4497 days ago

"as long as you attribute the source"

danielbarla said that they presented the material under a false name; this goes beyond copying and becomes plagiarism, which I can't imagine is an intended result of the CC license.

aroch 4497 days ago

Is the source 'User X' or 'StackOverflow'? When you reference CC BY-SA code you don't reference the people who, say, checked it into git but rather the whole repo.

Flimm 4497 days ago

CC BY-SA is short for Creative Commons Attribution Share-Alike. BY means you must attribute, and SA means you must license any distributed derivative works under the same license (copyleft). Attribution on its own is not enough.

grey-area 4497 days ago

No, attribution is required.

jliptzin 4497 days ago

Interesting, from the file sizes you can quickly gauge the relative popularity of each subject.

tobehonest 4497 days ago

By having a tl;dr about the actual Wikipedia page, there is no need for the user to click on the link. Following what you're saying, Google as wrapped it in their own layer, and trying to take ad revenue.

smoyer 4497 days ago

Actually, I find that having a tl;dr will rarely answer the question(s) I have on a topic, but it will commonly show me whether I've found the right wikipedia page. I usually either click-through or refine my search.

bushido 4497 days ago

They don't actually link to the wikipedia URL. They mask a link that leads to another Google page "/url?sa=t&rct=j&q=&...." which in turn responds with a 200 OK page that redirects to Wikipedia.

Sure it passes the keywords etc. But this likely reduces the number of people visiting Wikipedia, while increasing Google's ad revenues, if anyone but Google did this they'd be potential blacklisted by Google.

VikingCoder 4497 days ago

Actually, they do link to the wikipedia URL.

href="http://en.wikipedia.org/wiki/Scraper_site" appears directly in the source code of that web page.

It also has an onmousedown handler that rewrites the URL to point at Google, so they can tell which link you clicked, to improve their ranking system. And Google works very closely with sites to make sure the sites know how to understand the referrals.

jpeterson 4497 days ago

Relax, it's a joke.

MWil 4497 days ago

If Google only needs to visit Wikipedia's "scraper" page once a day or less, but serves it out to others with attribution, isn't that helping Wikipedia by lowering traffic COSTS?

baldfat 4497 days ago

BUT it gives full attribution to http://en.wikipedia.org/wiki/Scraper_site???

DanBC 4497 days ago

That is what paren comment says.