Hacker News new | ask | show | jobs
by sixtyfourbits 1750 days ago
It's all PDF files, which have their own compression, so it's unlikely there would be substantial gain from additional compression. Each torrent has 100 zip files, and each zip file has 1000 PDFs, but the files are stored uncompressed within the zips (i.e. using the STORE method).
3 comments

> It's all PDF files, which have their own compression, so it's unlikely there would be substantial gain from additional compression.

You could write a custom compressor that decompiles journal PDFs to valid TeX, then compresses that.

Or at the simpler end of what's technologically possible, you could at least extract shared assets such as fonts that appear in multiple files. Keep files from the same journal together to find more overlaps.

I suspect there's quite a large gain to be had from further compression, at least theoretically. Even more if you could accept some level of non-semantic loss.

You could losslessly translate PDFs with compression to PDFs with no compression (bitmap images excepted), tar them up and compress the lot. This would get you a fair bit of gain for little pain.

However, I guess they use .zip STORE because it's fairly robust against minor corruption.

Is there some kind of searchable index included so that you can locate an article in a particular Zip? I'm assuming each article has some kind of ID numbers and the Zips are divided by ID range or something?
Yep! See https://github.com/sci-hub-p2p/artifacts/releases/tag/0

This project is in it's early stages and the documentation has quite some way to go, but the index that's part of the release contains all the necessary information. This tool also contains the code necessary to produce the index files if you have a local copy of the zips.

Each torrent contains 100,000 files, comprised of 100 zip files with 1,000 PDFs each. They are named by DOI. There's a database dump at (http://libgen.rs/dbdumps/) (scimag.sql.gz) which has the id -> DOI mapping and other information. The specific torrent and zip file can be determined based on the id; torrent = id/100000 and zip = id/1000.

Sci-Hub database/index is available here: http://libgen.rs/dbdumps/scimag.sql.gz

and database documentation is available here: https://gitlab.com/lucidhack/knowl/-/wikis/References/Libgen...

also see introduction to Sci-Hub for developers: https://www.reddit.com/r/scihub/comments/nh5dbu/a_brief_intr...

But each PDF is compressed individually. The textual content of the papers must have a lot of redundancy between them, maybe there is some gain to get there?
Illustrations easily outweigh the textual content, and those aren't shared. I mean, the text/formatting/latex code for an article compresses to something like 10kB, there's not much to save there.