| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by apo 2532 days ago

> Over the past year, Malamud has — without asking publishers — teamed up with Indian researchers to build a gigantic store of text and images extracted from 73 million journal articles dating from 1847 up to the present day.

Maybe I missed it, but the article doesn't seem to explain exactly how Malamud's group compiled its database.

Throttling is a major problem with the naive approach of throwing wget on a publisher site. The publisher detects a bot on its network downloading everything in sight and either slows data transfer to a trickle or just shuts down access to it.

The publishers may not win on copyright, but they may try to make a case based on criminality if Malamud's team actively took steps to circumvent throttling and defeat the defenses of the hosting sites. Especially if the publisher knows what to look for in its logs.

1 comments

toomuchtodo 2532 days ago

Scihub? Bonus points if your properly executed scraping project backfills Scihub where it is missing DOIs.

link