Hacker News new | ask | show | jobs
by covertlibrarian 2337 days ago
For those who support sci-hub's mission and would like to help ensure it can never be taken away, you may be interested in the following project to lay the groundwork necessary for a widely-replicated, decentralized version of the repository:

https://www.reddit.com/r/DataHoarder/comments/ed9byj/library...

All of the documents accessed through sci-hub are archived by the library genesis project and made available as torrents. Currently there are just over 80 million articles included in these torrents. The total size of the archive is around 70TB. The link above also refers to the library genesis books collection, which is 33TB.

This effort has seen tremendous interest in recent weeks; the books collection (libgen) is now widely replicated, but around a third of the scimag articles collection (i.e. those from sci-hub) has only 1-2 reliable seeders and needs more before it can be considered safely backed up. If you have the resources available, I would encourage you to consider assisting. Previous discussion of this project is at https://news.ycombinator.com/item?id=21692222

Looking forward, the next step is to find a suitable way of providing accessible, truly decentralized access, without relying on a single point of failure (i.e. a web interface). Some have been exploring IPFS as a potential mechanism, but there are many ways this could be done. This is a challenging but important problem to solve; with the data available, there is now an opportunity for developers to address the access issues. There may come a day when the sci-hub website goes offline, and it would be good if a fallback is already in place at that point.

Aaron Swartz may no longer be with us, but his spirit lives on. It's now up to us to carry on the fight for which he paid so dearly.

5 comments

It's still extremely raw, but I've done some work on decentralized access to sci-hub. Given an index of scihub ID <> DOI which can be generated from the DB dump made available by libgen https://github.com/frrad/scimag it's possible to selectively download only the parts of the torrent required to get the article you are looking for.

Check it out, contributions welcome. https://github.com/frrad/skyhub

Except everything after "Looking forward," is still an open research problem.

> Some have been exploring IPFS as a potential mechanism

IPFS has no anonymity, so individual nodes could be attacked the way individual torrenters got sued back in the day.

> but there are many ways this could be done

There are currently no scalable ways this could be done. If there were you wouldn't have a single bold researcher providing a single point of failure search engine that hops to some other location each time authorities shut it down.

There is nothing that stops nodes from hosting IPFS content from behind a VPN.
Then you're just moving the point of failure to the VPN.
I wish I could help, but I cannot afford at least 70TB even for myself..
Then this comment isn't for you, but rather other HN readers.

If you believe in the importance of science, and if you're a software engineer at FAAAANG, then you can afford a 75+45 TB (scihub+libgen) hard drive array. If you are making $300,000/year then a $5-10k hobby project to store distilled human progress is something that you could make financially possible for yourself.

Consider doing this, because this might be the last opportunity to get a relatively complete copy. Just having a copy and letting it sit for 10 or 20 years can be hugely valuable to the world, let alone your community.

And of course you can partner with some other like-minded folks.
For sourcing drives probably better to go with buying external 10 TB drives (and shucking them). Make a JBOD.. I dunno.

A quick ebay search right now shows used LTO8 drive for $3K (same as new), LTO7 for $1.8K, LTO6 for $0.5K, LTO5 for $0.15K. If you shop around, you can find much better deals.

Here are some tape costs:

LTO-5 (1.5TB/$19.60 = $13.07/TB) LTO-6(2.5TB/$22.58=$9.03/TB) LTO-7 (6TB/$57.95=$9.66/TB) LTO-7 type M (9TB/$57.95=$6.44/TB) LTO-8 (12TB/$134.25=$11.19/TB)

Breakevens:

    LTO8=300T
    LTO7=300T
    LTO6=75T
    LTO5=50T
    LTO4 and below=never
Of course without a tape robot, no one should be using LTO5--there's some personal inconvenience cutoff for everyone.
10TB drives frequently go on sale for ~$160. It may not even cost $10k.
the data is divided up into torrents of a few GB to some dozens of GB. pick a few that you find numerologically interesting (i download ones with my birth year in their number) and back those up.
What is the legal situation on this?

I'd love to seed. I even have the space for quite a lot but seeding illegal content in Germany is a big nono.

I think this project is an exercise of civil disobedience.
> but seeding illegal content

the content isn't illegal, the act of copying it might be

An important distinction frequently blurred by copyright maximalists. Thanks for pointing it out.
Aaron Swartz may no longer be with us, but his spirit lives on. It's now up to us to carry on the fight for which he paid so dearly.

Has anyone proposed naming an "information making free" service after him yet?

There's the "Aaron Swartz Day" which both Internet Archive and EFF celebrate.

Then there's Star Wars, Swartz always being you you, and such (a joke made with Jonathan Schwartz as well). I quite like that saying, but it may be considered cheap/offending by some.

I like that idea but I'd like to reserve it for something that actually makes impact rather than for someone just starting using Aaron's name. Also, I think you should always contact the family first to ask for permission.