| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dooglius 2390 days ago
	There is a huge amount of duplication there (i.e. books that have many scans), I wonder if it would be better to tackle that versus doing a straight backup.

3 comments

legatus 2390 days ago

There are groups behind data curation as well, though it is much harder. LibGen sees an addition rate of about 230 GBs per month, while SciMag's is around 1.10 TBs per month. We should expect those numbers to increase in the future. The man-hours required to curate those database may very well cost much more than the storage and bandwidth required to store duplicates and incorrectly tagged files. In any case, as I said, there are people seriously interested in curating the LibGen database, though most efforts I know of are still in the earliest stages.

agumonkey 2390 days ago

Do you know if they process PDF to reduce file size ?

guidoism 2390 days ago

A lot of the data is in the djvu format which is very efficient for scanned books.

Mediterraneo10 2390 days ago

This is a downside of Libgen: duplicate uploads, missing or erroneous metadata. You start wishing that there was at least some curation of the collection, so it could approach the quality of an academic library catalogue as many users are usedto. But I guess the people behind Libgen want to keep the number of people with database edit rights small. (When you upload a book, you yourself can edit the metadata for that book for 24 hours, but you cannot go through the rest of LibGen's database and make corrections.)

jplayer01 2390 days ago

Maybe they should consider a system where users can suggest tags/metadata or flag erroneous data that can be reviewed and allowed by a select few?

Avamander 2390 days ago

Integration with BookBrainz would be nice. The Brainz projects already consist of massive amounts of metadata curation and it would be possible to transfer that knowledge a bit.

Invictus0 2390 days ago

I think the duplication issue is probably overstated. I doubt tackling that would shave off more than 20% of the total backup size.

dooglius 2390 days ago

Speaking from personal experience, I usually see several results for any search. Granted, there's a big selection bias there, but 20% seems way too small.

abdullahkhalids 2390 days ago

Because you or anyone is most likely to search for relatively popular books. So those books will have a multiple copies. But for every popular book, there are many unpopular, but still useful books, that only have a single copy.

sgillen 2390 days ago

To be fair for textbooks at least I often see several results but often of different editions (1x edition 1, 2x edition 2, 1x edition 3 etc.). In some cases I think it's worthwhile keeping the different additions around, unless it becomes a huge burden.

MiroF 2390 days ago

Usually the different results have meaningful differences - often times different edition or translator etc

asdff 2390 days ago

In my experience it's different editions or mirrors.

throwaway894345 2390 days ago

It's probably more of a nuisance for people wanting to use the content. E.g., copies with different metadata or tags.

driverdan 2390 days ago

20% is not insignificant.

roland00 2390 days ago

Forking the LibGen to save 20% of file sizes will be counterproductive. Yes you save some storage but the network effects is more important, for people willing to contribute to "the one true thing" actually provides more seeders than the 20%.