| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jandrese 1840 days ago

Millions of individual torrents is not a great solution. Keeping them all seeded is basically impossible unless they run a seed for each one, at which point they might as well just host the files. Plus you'll never get the economy of scale that makes BitTorrent really shine.

When you have a whole lot of tiny files that people will generally only want one or two of there isn't much better than a plain old website.

A torrent that hosts all of the papers could be useful for people who want to make sure the data can't be lost by a single police raid.

8 comments

dredmorbius 1840 days ago

What documents (books, scientific articles) benefit from specifically is a number of highly consistent, highly accurate identifiers: DOI (scientific articles), ISBN (published books), and others (OCLC identifier, Library of Congress Catalogue Number, etc.)

With the addition of hashsums (even MD5 ad SHA1, though longer and more robust hashsums are preferred), a pretty reliable archive of content can be made. It's a curious case where increased legibility seems to be breaking rather than creating a gatekeeper monopoly.

I've been interested in the notion of more reliable content-based identifiers or fingerprints themselves, though I've found little reliable reference on this. Ngram tuples of 4-5 words are often sufficient to identify a work, particularly if a selection of several are made. Aggreeing on which tuples to use, how many, and how to account for potential noise / variations (special characters, whitespace variance, OCR inaccuracy) is also a stumbling point.

spicybright 1840 days ago

Why map anything to words for strict identification? Words and language are very error prone vs an id number or hash.

dredmorbius 1840 days ago

It's a bit of an itch I've been scratching for a few years.

Most especially, given two or more instances of what you suspect to be the same or a substantively similar work, how can you assess this in a robust and format-independent manner, programmatically?

For works with well-formed metadata, this isn't an issue.

For identical duplicate copies of the same file, a hash is effective.

But for the circumstance most often encountered in reality --- different forms and formats derived from different sources but containing substantially the same work --- there is no simple solution of which I'm aware. As examples, say you have a reference document The Reference Document.

How do I determine that:

- An ACSCII-only textfile

- Markdown, HTML, DocBook, and LaTeX sources

- PDF, MS Word (which version), PS, DJVU, ePub, or .mobi files (sling any other formats you care to mention).

- Hardbound and paperback physical copies

- Scans made from the same or different physical books or instances, versions, and/or translations.

- Audiobooks based on a work. By the same or different readers.

- Dramatic performances, films, video series, comic-book adaptations, etc., of a work. (Say: Hamlet or Romeo and Juliet. What is the relationship to "West Side Story" (and which version), or Pyramus and Thisbe?)

- Re-typed or OCRed text

... all refer to the same work?

How do you define "work"?

How do you define "differences between works"?

How do you distinguish intentional, accidental, and incidental differences between instances? (Say: translations, errata, corrections, additions for the one, transcription errors for the second, and scanning or rendering artefacts for the third.)

If you're working in an environment in which instances of works come from different sources with different provenances, these questions arise. At least some of these questions are prominent in library science itself. It's the technical mapping of digitised formats I'm focusing on most closely, so the physical instantiations aren't as critical here, though the presumption is that these could be converted to some machine-readable form.

In bibliographic / library science, the term is "work, expression, manifestation"

https://www.loc.gov/marc/marbi/2011/2011-dp03.html

jl6 1840 days ago

The general problem here is not solvable with technology if there is no universally agreed definition for “a work” - and there isn’t (this touches on some profound issues of ontology).

And so I suspect the way forward is to maintain human-curated mappings of file hashes to “works”, where “a work” is a matter of the curator’s opinion, and different curations will be valued differently by different consumers of that information. For example, maybe a literary expert would have a great and respected method of organizing the works and derived works of Shakespeare, but that same person might not be sought out for their views on pop songs.

You could probably start with an ML-based curation that gets it 80% right, and fill out the last 20% with gamified crowdsourcing (with multiple competing interpretations of the last 20%).

dredmorbius 1840 days ago

Yes, it's complicated.

All analogies melt if they're pushed loudly enough. And all models are wrong, though some are useful.

The notion of a work has utility, it respects the notion of different forms, variations, and evolution with time. If you're looking at, say, multiple editions of a book, or even of something much more dynamic, say, source code or a Wiki entry, yes the specific content may change at any point, and stands through many versions, but those are connected through edit events. A good revision control system will capture much of that, if the history interests you.

Ultimately, I'd argue that "work" is defined in relationships and behaviours. A record intermediates between author(s) and reader(s) (or listeners, viewers, etc.), concerning some informational phenomenon, perhaps fictional, perhaps factual, perhaps itself an action (as in a marriage record, divorce decree, or court decision). The work in its total context matters. (At which point we discover most works have very little context...).

The file-hashes-to-work mapping is all but certain to play a large role, but even that is only a means of indicating a relationship that is established by some other means.

The notion of selecting an arbitrary set of ngram tuples to establish highly probable relationsip is likely to remain at least one of those means.

And yes, the incremental / tuned approach is also likely a useful notion.

Paul Otlet had a lot to say about "documents", though I think "records" is a better term for what he had in mind, as any persistent symbolic artefact: book, painting, music, photograph, film, etc.

mathnmusic 1840 days ago

I have been dealing with the same problem for curating resources at https://learnawesome.org. Projects like Openlibrary do collect unique identifiers for _books_, but for everything else, it mostly takes manual effort. For example, I collect talks/podcasts by the author where they discuss ideas from their books. Then there are summaries written by others.

dredmorbius 1840 days ago

There's a lot of work toward this in library space, though it takes some adaptation to new media formats. Paul Otlet worked in a paper-only medium in the early 20th century but also has some excellent thinking. His books are now seeing translation from French. The Internet Archive and Library of Congress are also doing a lot of relevant work, see the WARC format as an example.

What's particularly relevant now are ephemeral and/or continuously updated online content --- and not just the WWW (http/https), but other protocols (ftp, gemini, ipfs, torrents, ...), as well as apps.

A working truism I developed was that "identity is search that produces a single result". So if you can come up with something that uniquely identifies a work, then that can be a working identifier. I typically focus on what can be reasonably assessed of author, title, publication date, publisher (traditional, website/domain), and failing that, descriptive text. Remember that originally titles were simply the introductory lines of works (a practice that remains used in some cases, e.g., the names of church masses or prayers, e.g., "Kyrie Eleison").

The Superintendent of Documents (SuDoc) Classification Scheme (used by the US goverment and GAO) and operates by agency, type of publication, and further divisions, as well as date/year. https://www.fdlp.gov/about-fdlp/22-services/929-sudoc-classi...

mandelken 1840 days ago

Probably because for written text the words identify the content while the hash relates more to the digital carrier format (pdf vs epub) and id number can change between publications, countries, etc.

dredmorbius 1840 days ago

Bingo.

And to drag in metadata, it may:

- Not be present.

- Be inaccurately applied to the correct work (metadata say the work is different, work is in fact related/same).

- Be inaccurately applied to the wrong work (metadata say the works are the same/related, they are not).

fnord77 1840 days ago

text to speech the doc then an acoustic fingerprint on the audio :)

dredmorbius 1839 days ago

You'd all but certainly be better going in the other direction.

Text is a more constrained state space than speech/audio.

contravariant 1840 days ago

There was that project some guy posted a while back that used a combination of sqlite and partial downloads to enable searches on a database before it was downloaded all the way. If you can fit PDFs somewhere into that you'd be golden.

Or just use IPFS I suppose.

divbzero 1840 days ago

IPFS would face a similar challenge as the “keep torrents seeded” problem mentioned by GP. Wouldn’t there be risk to peers who host the PDFs?

Natsu 1840 days ago

I sort of feel like there should be some way to use some kind of construct to get people to seed things so that others seed things for them, but I haven't seen that invented yet.

miloignis 1840 days ago

Been a while since I've looked at them, but IPFS with FileCoin and Ethereum Swarm had that kind of goal.

It might be beneficial to create something like what you describe without any cryptocurrency association though, and I've been mulling over possibilities for distributed systems that are inherently currency-less to avoid all of the scams that cryptocurrency attracts.

Taek 1840 days ago

The leader in that space is Skynet, which basically is like IPFS + Filecoin but also has dynamic elements to it, and a lot better performance + reliability.

Cryptocurrency is helpful because it allows you to incentivize people to hold the data. If you don't have cryptocurrency, you're basically dependent on altruism to keep data alive (like bittorrent, or ipfs without filecoin). Popular files do okay for a while, but even popular files struggle to provide good uptime and reliability after a few years.

On an incentivized network like Sia or Filecoin, you can have high performance and reliability without ever needing any users to keep their machines online and seeding.

lobocinza 1840 days ago

Does it scale well? SciHub is at least 100TB.

zolland 1840 days ago

I think seed ratios and seed time (mostly used by private trackers) attempt to solve this problem.

zolland 1840 days ago

What kind of risk?

BelenusMordred 1840 days ago

IPFS is not anonymous and like other p2p protocols shares your ip address. People seeding articles would get legal notices just like torrents now.

There's been a bit of effort to get it working over tor for years now but the fundamental design makes this difficult. Also despite all the money that has poured into filecoin this doesn't seem to be a priority.

This issue is nearly 6 years old:

https://github.com/ipfs/notes/issues/37

divbzero 1840 days ago

I was thinking legal risk. In this case the publishers are going after Sci-Hub, but in the past they have gone after individuals.

o8r3oFTZPE 1840 days ago

"There was that project some guy posted a while back that used a combination of sqlite and partial downloads to enable searches on a database before it was downloaded all the way."

https://github.com/bittorrent/sqltorrent

jagged-chisel 1840 days ago

this is the one: https://phiresky.github.io/blog/2021/hosting-sqlite-database...

HN submission: https://news.ycombinator.com/item?id=27016630

o8r3oFTZPE 1840 days ago

This is the original. Then came https://github.com/lmatteis/net-torrent and later one written in Javascript, inspired by net-torrent.

hkt 1840 days ago

Isn't that essentially mapreduce? Either way, interesting and I'd love to see the link.

ric2b 1835 days ago

It's this one: https://phiresky.github.io/blog/2021/hosting-sqlite-database...

vorticalbox 1840 days ago

I believe this is the project mentioned

https://github.com/lmatteis/torrent-net

contravariant 1840 days ago

That one looks familiar. Though apparently the same thing has been tried in several different ways going by the replies I got.

tmkadamcz 1840 days ago

This looks like it could be a good approach.

posterboy 1840 days ago

a plain old website or a publishing house with distribution services and syndication attached, but for a sane price.

"a whole lot of tiny files" severely underestimates the scale at work. Libgen's coverage is relatively shallow, and pdf books tend to be huge, at least for older material. Scihub piggy backs on the publishers, so that's your reference.

syndication, syndicate, quite apt don't you think? Libraries that coluded with the publishers and accepted the pricing must have been a huge part of the problem, at least historically. Now you know there's only one way out of a mafia.

jandrese 1840 days ago

In Internet scale it's not a lot of data. Most people who think they have big data don't.

Estimates I've seen put the total Scihub cache at 85 million articles totaling 77TB. That's a single 2U server with room to spare. The hardest part is indexing and search, but it's a pretty small search space by Internet standards.

andyxor 1840 days ago

The entire archive actually fits in a small desktop NAS (e.g. QNAP or Synology) with a few 14-18TB drives, you don't even need a server rack.

There is existing index in sql format distributed by libgen: https://www.reddit.com/r/scihub/comments/nh5dbu/a_brief_intr..., it is around 30GB uncompressed.

Those 851 torrents uncompressed would probably take half a petabyte of storage, but I guess for serving pdfs you could extract individual files on demand from zip archive and (optionally) cache them. So the scihub "mirror" could run on a workstation or even laptop with 32-64GB memory connected to 100TB NAS over 1GBE, serving pdfs over VPN and using unlimited traffic plan. The whole setup including workstation, NAS and drives would cost $5-7K.

it's not a very difficult project and can be done DIY style, if you exclude the proxy part (which downloads papers using donated credentials). Of course it would still be as risky as running Scihub itself which has $15M lawsuit pending against it.

dredmorbius 1840 days ago

The entire Library of Congress books collection is on the order of 40 million items.

At 5 MB per book, this works out to about 200 TB of disk storage.

At about $12/TB, hosting the entire LoC collection would cost roughly $2,400 presently, with prices halving about every three years.

dredmorbius 1840 days ago

Note that $2,400 is disks alone. You'd obviously need chassis, powere supplies, and racks. Though that's only 17 12 TB drives.

Factor in redundancy (I'd like to see a triple-redundant storage on any given site, though since sites are redundant across each other, this might be forgoable). Access time and high-demand are likely the big factor, though caching helps tremendously.

My point is that the budget is small and rapidly getting smaller. For one of the largest collections of written human knowledge.

There are some other considerations:

- If original typography and marginalia are significant, full-page scans are necessary. There's some presumption of that built into my 5 MB/book figure. I've yet to find a scanned book of > 200MB (the largest I've seen is a scan of Charles Lyell's geology text, from Archive.org, at north of 100 MB), and there are graphics-heavy documents which can run larger.

- Access bandwidth may be a concern.

- There's a larger set of books ever published, with Google's estimate circa 2014 being about 140 million books.

- There are ~300k "conventionally published" books in English annually, and about 1-2 million "nontraditional" (largely self-published), via Bowker, theh US issuer of ISBNs.

- LoC have data on other media types, and their own complete collection is in the realm of 140 million catalogued items (coinciding with Google's alternate estimate of total books, but unrelated). That includes unpublished manuscripts, maps, audio recordings, video, and other materials. The LoC website has an overview of holdings.

Published document scarcity is entirely imposed.

HWR_14 1840 days ago

It still amazes me that 77TB is considered "small". Isn't that still in the $500-$1,000 range of non-redundant storage? Or if hosted on AWS, isn't that almost $1,900 a month if no one accesses it?

I know it's not Big Data(tm) big data, but it is a lot of data for something that can generate no revenue.

smichel17 1840 days ago

> Isn't that still in the $500-$1,000 range of non-redundant storage?

Sure. Let's add redundancy and bump by an order of magnitude to give some headroom -- $5-10k is a totally reasonable amount to fundraise for this sort of application. If it were legal, I'm sure any number of universities would happily shoulder that cost. It's miniscule compared to what they're paying Elsevier each year.

HWR_14 1840 days ago

Sorry. My point was it was a lot of money precisely because it cannot legally exist. If it could collect donations via a commercial payment processor, it could raise that much money from end users easily. Or grants from institutions. But in this case it seems like it has to be self-funded.

pbhjpbhj 1840 days ago

I'm prepared to accept "does generate no revenue" but "can generate no revenue" ...?

Perhaps some sort of MTurk or captcha-like tasks per access? Patr[e]ons? Donation drives? Micro-payments? Something else??

HWR_14 1840 days ago

Oh, it could generate revenue if it was legal. But it is not, so it seems difficult.

dredmorbius 1840 days ago

For an institution, it's a rounding error.

AWS is not the cheapest bulk-storage hosting possible.

matthewdgreen 1839 days ago

Google already does a pretty good job with search. Sci-Hub really just needs to handle content delivery, instead of kicking you to a scientific publisher's paywall.

einpoklum 1840 days ago

If the sane price is an optional "Donate to keep this site going" link, then ok. But only free access, without authentication or payment, to scientific papers, is sane. IMHO.

munk-a 1840 days ago

Might this be a case where the best resolution would be to have the government (which is at least partially funding nearly all of these papers) step in and add a ledger of papers as a proof of investment?

The cost of maintaining a free and open DB of scientific advances and publications would be so incredibly insignificant compared to both the value and the continued investment in those advancements.

jpeloquin 1840 days ago

> Might this be a case where the best resolution would be to have the government (which is at least partially funding nearly all of these papers) step in and add a ledger of papers as a proof of investment?

I feel that we're halfway there already and are gaining ground. Does Pubmed Central [0] (a government-hosted open access repository for NIH-funded work) count as a "ledger" like you're referring to? The NSF's site does a good job of explaining current US open access policy [1]. There are occasional attempts to expand the open access mandate by legislation, such as FASTR [2]. A hypothetical expansion of the open access mandate to apply to all works from /institutions/ that receive indirect costs, not just individual projects that receive direct costs, would open things up even more.

[0] https://www.ncbi.nlm.nih.gov/pmc/

[1] https://www.nsf.gov/pubs/2016/nsf16009/nsf16009.jsp#q1

[2] https://sparcopen.org/our-work/fastr/

einpoklum 1840 days ago

Well, some research venues (and publication venues) are not government-funded, and even if they are indirectly government funded, it's more of a sophistry than something which would make publishers hand over copies of the papers.

Also, a per-government ledger would not be super-practicable. But if, say, the US, the EU and China would agree on something like this, and implement it, and have a common ledger, then it would not be some a big leap to make it properly international. Maybe even UN-based.

That's a pretty big "if" though.

posterboy 1833 days ago

I share the sentiment insofar as free access would benefit my own sanity, except when it is about hording.

On the other hand, there is a slippery sloap to decide what isn't scientific so much as to not be required open knowledge.

By the way, specialist knowledge and open knowledge is kind of a dichotomy. You would need to define the intersection of both. Suddenly you are looking at a patent system. Pay to Quote, citation fees, news websites already are demanding this from google, here in Germany, inuding Springer Press

whimsicalism 1840 days ago

Libgen's coverage is definitely more shallow than scihub, but it is still pretty good.

tmkadamcz 1840 days ago

There are already torrents of the archives. But supposing scihub was taken down it's pretty non trivial to get from the archive back to a working site with search functionality. For one thing, none of Sci-Hub's code is available.

derefr 1840 days ago

Seems like what should be in each torrent is a virtual appliance preloaded with one shard of the data, where that virtual appliance has a pre-baked index for searching just that shard's data. Then one more torrent for a small search-coordinator appliance that fans your search query out to all N shard appliances.

dpacmittal 1840 days ago

BitTorrent does allow you to download a single file from a torrent though. You could have torrent for each month, and a client which allows you to search inside these torrents and download only the files you need.

mark-wagner 1840 days ago

Maybe Usenet? It already support massive copyright infringement yet it is still around.

sildur 1840 days ago

Maybe we can create a freesite, on Freenet.

fabioyy 1840 days ago

only if it was possible to use chia to store content.. it would be a game changer