Hacker News new | ask | show | jobs
by dooglius 2390 days ago
There is a huge amount of duplication there (i.e. books that have many scans), I wonder if it would be better to tackle that versus doing a straight backup.
3 comments

There are groups behind data curation as well, though it is much harder. LibGen sees an addition rate of about 230 GBs per month, while SciMag's is around 1.10 TBs per month. We should expect those numbers to increase in the future. The man-hours required to curate those database may very well cost much more than the storage and bandwidth required to store duplicates and incorrectly tagged files. In any case, as I said, there are people seriously interested in curating the LibGen database, though most efforts I know of are still in the earliest stages.
Do you know if they process PDF to reduce file size ?
A lot of the data is in the djvu format which is very efficient for scanned books.
This is a downside of Libgen: duplicate uploads, missing or erroneous metadata. You start wishing that there was at least some curation of the collection, so it could approach the quality of an academic library catalogue as many users are usedto. But I guess the people behind Libgen want to keep the number of people with database edit rights small. (When you upload a book, you yourself can edit the metadata for that book for 24 hours, but you cannot go through the rest of LibGen's database and make corrections.)
Maybe they should consider a system where users can suggest tags/metadata or flag erroneous data that can be reviewed and allowed by a select few?
Integration with BookBrainz would be nice. The Brainz projects already consist of massive amounts of metadata curation and it would be possible to transfer that knowledge a bit.
I think the duplication issue is probably overstated. I doubt tackling that would shave off more than 20% of the total backup size.
Speaking from personal experience, I usually see several results for any search. Granted, there's a big selection bias there, but 20% seems way too small.
Because you or anyone is most likely to search for relatively popular books. So those books will have a multiple copies. But for every popular book, there are many unpopular, but still useful books, that only have a single copy.
To be fair for textbooks at least I often see several results but often of different editions (1x edition 1, 2x edition 2, 1x edition 3 etc.). In some cases I think it's worthwhile keeping the different additions around, unless it becomes a huge burden.
Usually the different results have meaningful differences - often times different edition or translator etc
In my experience it's different editions or mirrors.
It's probably more of a nuisance for people wanting to use the content. E.g., copies with different metadata or tags.
20% is not insignificant.
Forking the LibGen to save 20% of file sizes will be counterproductive. Yes you save some storage but the network effects is more important, for people willing to contribute to "the one true thing" actually provides more seeders than the 20%.