| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by shakna 3013 days ago

I don't know how they are doing it, but Google Scholar does not have an API, and scraping is against their TOS.

> Don’t misuse our Services. For example, don’t interfere with our Services or try to access them using a method other than the interface and the instructions that we provide.

Despite this, there is scholar.py [0], which can extract files from Google Scholar, though it explicitly doesn't work around the rate limits.

[0] https://github.com/ckreibich/scholar.py

2 comments

userbinator 3013 days ago

or try to access them using a method other than the interface

Unless this actually exploits something and hacks into Google's servers to get to the content, which would be something quite different, it wouldn't really be distinguishable from someone manually visiting the site in a browser, volume aside.

IMHO the pervasive attitude today of somehow requiring permission or an explicitly sanctioned "API" to access what is otherwise publicly accessible data is rather troubling for the freedom and flexibility of the Web as a whole. It encourages walled-garden content models and centralisation.

link

shakna 3013 days ago

I absolutely agree. If something is publicly accessible then the public should be able to use it as they see fit, from my viewpoint. (A HTTP response has already authorised you to copy the data to a machine. How can it be bound by a TOS that you need to access the original page to find?)

However, Google doesn't agree and the current court precedent doesn't either. So I tried to address the parent's concern from that viewpoint.

link

TeMPOraL 3013 days ago

Yup. I don't believe web hosts should be entitled to that much control.

My browser is my User Agent. The way it renders or interprets the data is my business.

link

crispyporkbites 3013 days ago

Http is an interface with implicit instructions (especially if restful), provided by google

link