Look at the google books project. That got shutdown real hard due to copyright issues and litigation after they invested a ton of money in digitizing some of the most valuable library collections in the world.
> Somewhere at Google there is a database containing 25 million books and nobody is allowed to read them.
Indeed, what an intellectual tragedy..
> In August 2010, Google put out a blog post announcing that there were 129,864,880 books in the world. The company said they were going to scan them all.
That seems like a surprisingly "small" number.
Well, in trying to picture a physical library with 130 million books, maybe that's a realistic estimate. But compared to, say, the recently discovered data hoard of more than 2 billion online identities, it's miniscule.
SciHub and LibGen are truly the modern-day Library of Alexandria. The fact that they're being called "Pirate Bays of Science" - and that providing free and open access to all books in the world is illegal - just goes to show that our civilization's priorities are misdirected.
Until fairly recently (historically), books were overwhelmingly scarce. A few datapoints:
- The total number of books -- not titles, but actual bound volumes -- in Europe as of 1500 CE, was about 50,000. By 1800, the total was just under one billion.
- The library of the University of Paris circa 1000 CE comprised about 2,000 volumes. It was among the largest in Europe.
- The Library of Constantinople in the 5th century had 120,000 volumes, the largest in Europe at the time.
- A fair-sized city public library today has on the order of 300,000 volumes. A large university library generally a millon or so. The Harvard Library contains 20 million volumes. The University of California collection, across all ten campuses, totals more than 34 million volumes.
- The total surviving corpus of Greek literature is a few hundred titles. I believe many of those were only preserved through Arabic scholars, some possibly in Arabic translation, not the original greek.
- There's an online collection of cuneiform tablets. These generally correspond to a written page (or less) of text, with the largest collections numbering in the tens of thousands of items.
- As of about 1800, the library of the British Museum (now the British Library) had 50,000 volumes. Again, among the largest of its time.
- From roughly 1950 - 2000, roughly 300,000 titles were published annually in the United States and/or English-language editions. R.R. Bowker issues ISBNs and tracks this. From ~2005 onward, "nontraditional" books (self- / vanity-published) have been about or above 1 million annually.
- The US Library of Congress, the largest contemporary library in the world, holds 24 million books in its main collection (another 16 million in large type), and has 126 million catalogued items in total (2015).
- At about 5 MB per book, in PDF form, total storage for the 38 million volumes of the Library of Congress would be slightly under 200 TB. At about $50/TB, that's $10,000 of raw disk storage. (Actual provisioning costs would be higher.) Costs are falling at 15%/year.
- Total data in the world comprises far more than books, and has been doubling about every 2 years. Or stated inversely: half of all the recorded information of humankind was created in the past two years.
Sources:
Some of this is off the top of my head, but partial support for the facts from:
Thank you for that, very interesting and educational. I love how you led up to the punchline. It made me see that books as a technology and artifact are part of the "history of information", and how books are becoming subsumed in a shared trajectory with media/data in general.
> half of all the recorded information of humankind was created in the past two years
That is shocking to imagine, and it's exponentially growing.
It reminds me of Vannevar Bush's "As We May Think", pointing out the emerging information overload in society. It certainly puts things in perspective, how we (humanity) have been making a conscious, collaborative effort to develop globally networked computers, one of whose important functions is to help us organize all the information, including books.
The conundrum it seems is that technology is also a massive multiplier/amplifier of the amount of data, that its capacity to help us organize would never catch up to what it's helping to produce.
> total storage for the 38 million volumes of the Library of Congress would be slightly under 200 TB
I guess it's redundant to say, but I'm sure in the near future that would fit on a thumb drive!
Bush's essay is of course a classic. There are some precursors -- there's a BBC interview of H.G. Wells describing something similar from the 1940s.[1] E.F. Forster's The Machine Stops has some similar ideas. And various encyclopaedists very much embodied similar ideals.
I've been listening to Peter Adamson's "History of Philsophy Without Any Gaps" podcast, which is excellent, and spends a fair bit of time looking at the historiography of the topic -- what works were preserved, how, various interpretations, practices, preservation, and losses. Interesting to note that most of the preserved Greek and Roman works were found in obscure Arabian monastaries and libraries. The mainstream collections themselves were often lost in raids, fires, or other mishaps. Which makes the LibGen situation all the more relevant and urgent.
(I'm a huge user of the site and others like it, for what it's worth.)
On the amount of total data being captured: there's a huge difference between quantity and quality measures of information. They're almost certainly inversely related.
Of what books were written in antiquity, up to the time of the printing press, say, odds were fairly strong that a work would be read.
At 1 million new titles being published per year, there are only 330 people in the US per book, or roughly 400 native English speakers worldwide. (With ~2 billion speakers worldwide, the total audience might reach 2,000 per book). Clearly, most of what's being written will have a very small, or no, audience.
For machine-captured data, the likelihood that any of it is seen directly by a human is vanishingly small. More of it will undergo some level of machine processing or interpretation, though even that only applies to a fairly small fraction of data. Insert old joke about the WORN drive: write once, read never.
As for storage costs (and/or size), at a 15% cost reduction per year, storage halves every 4.67 years (4 years and 8 months), which means that in 10 years, the $10k price tag becomes $2k, and in 20 years, it should be under $400. For the entire Library of Congress collection.
Flash drives seem to be increasing in capacity by a factor of 10 every 2.5 years. There are now 2 TB flash drives, so 200 TB might be as little as 5 years out. That ... still sounds optimistic to me.
While brushing up on the encyclopaedists, I found this little gem:
"Among some excellent men, there were some weak, average, and absolutely bad ones. From this mixture in the publication, we find the draft of a schoolboy next to a masterpiece." — Denis Diderot
Taking the quote out of context (and aside from its historical male-centered language) - it sure rings true of the current state of the web, as well as books.
About the inverse relationship of quantity vs quality, we seem to be drowning in quantity! As you've pointed out, there's great need for thoughtful organization and curation.
I like how you break down the quantifiable aspects to draw a historical trend and future projection. The rise of "data science" and "big data" in the past few decades really makes sense in this light.
I'm sure machine learning and "AI" will play an increasing role in the task of organizing and processing all this information, but at the bottom I feel that the most value probably comes from human curation.
LibGen has been an amazing resource for me as a lover of knowledge, a life-long book worm. I've got bookshelves and boxes full of physical books as well, but it's a drop in the ocean..
There are also multiple petabytes of microfiche scans of old newspapers. And of course nobody cares about it. The problem was shut down 2011ish and the data became "owned" by a team that didn't care for it. There was talk of just deleting the data because the team didn't want to pay for it. Ugh.
In this case the issue seems to have come from "copyright minimalists" instead : wanting the books to be freely available, rather than making money for Google...
I wonder why the Copyright Office didn't just buy Google Books, would only have cost a few hundred million $ ?
> Upon hearing that Google was taking millions of books out of libraries, scanning them, and returning them as if nothing had happened, authors and publishers filed suit against the company, alleging, as the authors put it simply in their initial complaint, “massive copyright infringement.”
This is where the project derailed and never quite recovered.
> As Tim Wu pointed out in a 2003 law review article, what usually becomes of these battles—what happened with piano rolls, with records, with radio, and with cable—isn’t that copyright holders squash the new technology. Instead, they cut a deal and start making money from it.
[...]
> now, in 2011, there was a plan—a plan that seemed to work equally well for everyone at the table
[...]
> DOJ’s intervention likely spelled the end of the settlement agreement. No one is quite sure why the DOJ decided to take a stand instead of remaining neutral. Dan Clancy, the Google engineering lead on the project who helped design the settlement, thinks that it was a particular brand of objector—not Google’s competitors but “sympathetic entities” you’d think would be in favor of it, like library enthusiasts, academic authors, and so on—that ultimately flipped the DOJ.
I’m fairly confident that google books is a huge money loser for google. The only reason it’s still online is because there are people within google willing to stick their necks out to spend the money on it.