| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mishafb 1750 days ago
	Is it compressible or already compressed?

2 comments

sixtyfourbits 1750 days ago

It's all PDF files, which have their own compression, so it's unlikely there would be substantial gain from additional compression. Each torrent has 100 zip files, and each zip file has 1000 PDFs, but the files are stored uncompressed within the zips (i.e. using the STORE method).

link

dmurray 1750 days ago

> It's all PDF files, which have their own compression, so it's unlikely there would be substantial gain from additional compression.

You could write a custom compressor that decompiles journal PDFs to valid TeX, then compresses that.

Or at the simpler end of what's technologically possible, you could at least extract shared assets such as fonts that appear in multiple files. Keep files from the same journal together to find more overlaps.

I suspect there's quite a large gain to be had from further compression, at least theoretically. Even more if you could accept some level of non-semantic loss.

link

ectopod 1750 days ago

You could losslessly translate PDFs with compression to PDFs with no compression (bitmap images excepted), tar them up and compress the lot. This would get you a fair bit of gain for little pain.

However, I guess they use .zip STORE because it's fairly robust against minor corruption.

link

Thorentis 1750 days ago

Is there some kind of searchable index included so that you can locate an article in a particular Zip? I'm assuming each article has some kind of ID numbers and the Zips are divided by ID range or something?

link

sixtyfourbits 1750 days ago

Yep! See https://github.com/sci-hub-p2p/artifacts/releases/tag/0

This project is in it's early stages and the documentation has quite some way to go, but the index that's part of the release contains all the necessary information. This tool also contains the code necessary to produce the index files if you have a local copy of the zips.

Each torrent contains 100,000 files, comprised of 100 zip files with 1,000 PDFs each. They are named by DOI. There's a database dump at (http://libgen.rs/dbdumps/) (scimag.sql.gz) which has the id -> DOI mapping and other information. The specific torrent and zip file can be determined based on the id; torrent = id/100000 and zip = id/1000.

link

andyxor 1750 days ago

Sci-Hub database/index is available here: http://libgen.rs/dbdumps/scimag.sql.gz

and database documentation is available here: https://gitlab.com/lucidhack/knowl/-/wikis/References/Libgen...

also see introduction to Sci-Hub for developers: https://www.reddit.com/r/scihub/comments/nh5dbu/a_brief_intr...

link

jobigoud 1750 days ago

But each PDF is compressed individually. The textual content of the papers must have a lot of redundancy between them, maybe there is some gain to get there?

link

PeterisP 1750 days ago

Illustrations easily outweigh the textual content, and those aren't shared. I mean, the text/formatting/latex code for an article compresses to something like 10kB, there's not much to save there.

link

dredmorbius 1749 days ago

Virtually all the works are published as PDFs. (There are some other formats, occasionally DJVU, etc.) There's integrated compression, though this can still vary tremendously by docuemnt.

Recent publications are virtually always based on direct PDF renders, and tend to be a few 100 kB per article.

Older publications are often scanned from paper-based copies, and can be about 10-20x larger, depending on the source. These may or may not have OCRed text, and OCR itself may be of variable quality. For documents with images or diagrams, those also add to both size and difficulty in vectorising copies.

It's possible to go through larger scans and regenerate them as rendered PDFs. That's intensive and error prone. There's also a range of viewpoints on archival as to whether it's preferable to retain the full expression of the original published version (and often accumulated marginalia and other marks of a specific instance), or to optimise for both storage and automated processing through reprocessed renders. The costs are high (typically you'll require a human or multiple humans to proof each work), though the storage and line-transmission savings are considerable.

I lean toward the latter myself. The attitude of other archivists (notably the Internet Archive) is to capturing as faithful a replication of originally-published formats as possible, at considerable cost in both storage and accessibility. (This applies to the Archives work in print, online / Web, and other document formats.)

Pressed, I'd strongly recommend a "capture what you can, reprocess according to need and demand as possible" approach.

link