Hacker News new | ask | show | jobs
by enan 3929 days ago
This is my attempt to solve the problem of sites becoming unavailable when they make it to the HN front page. It is also a trial balloon for a new service that I am developing. AMA! :)
1 comments

If you're mirroring articles on these sites without permission, wouldn't that be a violation of the owners' copyrights?
That's a valid concern. I believe this service is very similar to Google cache and the copying should be permissible under fair use [1]. But IANAL :)

[1] http://fairuse.stanford.edu/case/field-v-google-inc/

edit: we also respect robots.txt!

I looked at the judge's order in that case, which was very interesting. Some of the points he makes in Google's favor are:

- Google respects any "noarchive" tags that are on the page, so the page owner can control whether Google copies each page.

- The site owner can also prevent Google from copying the entire site (or parts of it) via robots.txt.

If I understand the argument correctly, this metadata, as set on the plaintiff's site, gave Google an implied license to use the content, based on widely-understood web conventions.

Also, the order notes that Google places a prominent banner on top of its cached pages stating that they're copies that may not be current. However, your copies seem to be indistinguishable from the original content. If somebody were to send someone else a link to one of your cached articles, it would be difficult to tell that it was a cached copy.

Thanks for the comments. Our crawler does respect the robots.txt standard and the nofollow tag. Seems like noarchive is what google recommends. Will look more into it.

Although we do put a banner on the index page - we don't have them on each page. Thanks for pointing it out - will fix!

Even more important than that for me (possibly for you too) is that you make sure that none of these pages make it into googles index.

The duplication of content (potentially sending the original pages down in search ranks) and the fact that you are polluting the organic search results for the sites you mirror could be a big issue for the owners of the pages.

Good point! There is a robots.txt that prevents the site from getting indexed now: http://hn.getpageback.com/robots.txt