| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dredmorbius 1749 days ago

Virtually all the works are published as PDFs. (There are some other formats, occasionally DJVU, etc.) There's integrated compression, though this can still vary tremendously by docuemnt.

Recent publications are virtually always based on direct PDF renders, and tend to be a few 100 kB per article.

Older publications are often scanned from paper-based copies, and can be about 10-20x larger, depending on the source. These may or may not have OCRed text, and OCR itself may be of variable quality. For documents with images or diagrams, those also add to both size and difficulty in vectorising copies.

It's possible to go through larger scans and regenerate them as rendered PDFs. That's intensive and error prone. There's also a range of viewpoints on archival as to whether it's preferable to retain the full expression of the original published version (and often accumulated marginalia and other marks of a specific instance), or to optimise for both storage and automated processing through reprocessed renders. The costs are high (typically you'll require a human or multiple humans to proof each work), though the storage and line-transmission savings are considerable.

I lean toward the latter myself. The attitude of other archivists (notably the Internet Archive) is to capturing as faithful a replication of originally-published formats as possible, at considerable cost in both storage and accessibility. (This applies to the Archives work in print, online / Web, and other document formats.)

Pressed, I'd strongly recommend a "capture what you can, reprocess according to need and demand as possible" approach.