| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by squigz 711 days ago
	You dropped a 0. Anna's Archive is currently 862.4 TB

3 comments

bityard 711 days ago

That is true. However, it also has a staggering amount of duplicate data. I have _heard_ that if you search for most any particular book, you often get a dozen results of varying sizes and quality. Even for the same filetype. It's a hard problem to solve, but if we had something that could somehow pick the "best" copy of a particular title, for every title in the library, Anna could likely drop the zero herself.

link

unaindz 708 days ago

As one of their blog posts explains that's by design, they download all versions of any file. The reasoning was that some worse quality video files will have subtitles or better audio than the high quality video.

Some filtering may be possible to automate but lots of the tasks involved will have to be manual. Like merging video and audio from different sources or syncing subtitles from another file.

link

squigz 710 days ago

The above number is excluding duplicates.

link

RachelF 711 days ago

Yes, too much for one person, but collectively it is possible to keep it alive.

If anyone wishes to help, you can generate a chunk in 1TB units and seed via BitTorrent here:

https://annas-archive.gs/torrents

link

CamperBob2 710 days ago

Honestly, if I can't have the whole thing, I'm not going to bother mirroring a 1TB fragment that's worthless by itself to everybody except copyright attorneys.

As ndriscoll points out, the only feasible way to distribute an archive of this size is with physical hard drives. I sure wish they would find a reasonably-trustworthy way to offer that.

link

MrDrMcCoy 708 days ago

Most of the books are bloated PDFs. I'm slowly working on a project to reliably convert PDF to DjVu, which on average yields a highly readable document that's 33% of the original size on disk. The project is proving difficult, as the tooling for DjVu is quite moldy now, and often needs to be manually reviewed to ensure the file remains readable. Pdf2djvu exists, but it's highly unreliable, and thus can't be used in bulk. Other ebook formats are XML-based and tend to be similarly bloated due to the overhead of the markup. It's a hard problem with so little in the way of good file format choices.

link

CamperBob2 707 days ago

That sounds like a pretty terrible idea, TBH. All of the best tooling is for PDFs, as you note, and storage will only get cheaper.

Ultimately that content is going to need to be represented as raw UTF-8 text and encoded images, so I don't see much upside to migrating it from one intermediate lossy file format to another.

link

squigz 710 days ago

You are never going to have a physical copy of the archive. It's nearly a petabyte in size.

link

ranger_danger 710 days ago

I know several datahoarders that have at least 1PB, also archive.org grows by that much at least every day

link

squigz 710 days ago

I assumed that GP was an average person who doesn't have a storage array sitting at home. I'm not really sure why the IA is relevant here

link

CamperBob2 710 days ago

1 PB of disk space would cost about $10K at this point in time. Not exactly unattainable. Looks like it would fit in a volume of space about the size of a standard refrigerator.

I'd be OK with both requirements.

link

shrubble 711 days ago

If you only care about non-fiction and science journals it is more like 250TB I think? Still several thousands in 22TB drives with ZFS though.

link

ndriscoll 710 days ago

22 TB drives are around $230 on ebay, so if you used 15 of them in raidz2, that'd be around $3500 (so maybe a little over $4k with the rest of the server), which is around the cost of a new mirrorless camera and a decent lens, so certainly within the realm of a hobbyist. You probably couldn't get away with downloading 250 TB in any reasonable timeframe with most US ISPs (or at least Comcast) though. That'd be over 2.5 months of 300 Mb/s non-stop. Even copying it from a friend using 2.5 Gbit/s Ethernet would take over a week.

link

userbinator 710 days ago

Tape might be a better choice with that much data.

link