Libgen is a civilizational project that should be endorsed, not prosecuted. I hope one day people will look at it and think how stupid we were today to shun the largest collection of literary works in human history.
Anna's Archive encourages (and monetizes!!) the use of their shadow library for LLM training. They have a page dedicated to it on their site. You pay them, and they give you high download speeds to entire datasets.
Libgen turns into a problem when you have a company developing generative AI with it, either giving money to GPU manufacturers or themselves with paid services (see OpenAI)
The issue is there's an asymmetry between buyer/seller for books, because a buyer doesn't know the contents until you buy the book. Reviews can help, but not if the reviews are fake/AI generated. In this case, these books are profitable if only a few people buy them as the marginal cost of creating such a book is close to zero.
This is starting to get pretty circular. The AI was trained on copyrighted data, so we can make a hypothesis that it would not exist - or would exist in a diminished state - without the copyright infringement. Now, the AI is being used to flood AI bookstores with cheaply produced books, many of which are bad, but are still competing against human authors.
> If they start out-competing humans, is that bad?
Not inherently, but it depends on what you mean by out-competing. Social media outcompeted books and now everyone's addicted and mental illness is more rampant than ever. IMO, a net negative for society. AI books may very well win out through sheer spam but is that good for us?
> Nobody has responded to me with anything about how authors are harmed
i imagine if books can be published to some e-book provider through an API to extract a few dollars per book generated (mulitiplied by hundreds), then eventually it'll be borderline impossible to discover an actual author's book. breaking through for newbie writers will be even harder because of all of the noise. it'll be up to providers like Amazon to limit it, but then we're then reliant on the benevolence of a corporation and most act in self interest, and if that means AI slop pervading every corner of the e-book market, then that's what we'll have.
kind of reminds me of solana memecoins and how there are hundreds generated everyday because it's a simple script to launch one. memecoins/slop has certainly lowered the trust in crypto. can definitely draw some parallels here.
> Nobody has responded to me with anything about how authors are harmed
The same way good law-abiding folk are harmed when Heroin is introduced to the community. Then those people won't be able to lend you a cup of sugar, and may well start causing problems.
AI books take off and are easy to digest, and before long your user base is quite literally too stupid to buy and read your book even if they wanted.
And, for the record, it's trivial to "out compete" books or anything else. You just cheat. For AI, that means making 1000 books that lie for every one real book. Can you find a needle in a haystack? You can cheat by making things addictive, by overwhelming with supply, by straight up lying, by forcing people to use it... there's really a lot of ways to "outcompete".
> It feels more like we just want to punish people, particularly rich people, particularly if they get away with stuff we're afraid to try.
If by "afraid to try" you mean "know to be morally reprehensible" and if by "punish people" you mean "punish people (who do things that we know to be morally reprehensible)", then sure.
But... you might just be describing the backbone of human society since, I don't know, ever? Hm, maybe there's a reason we have that perspective. No, it must just be silly :P
I think the concern goes to the point of copyright to begin with, which is to incentive people to create things. Will the inclusion of copyrighted works in llm training (further) erode that incentive? Maybe, and I think that's a shame if so. But I also don't really think it's the primary threat to the incentive structure in publishing.
Copyright was invented by publishers (the printing guild) to ensure that the capitalists who own the printing presses could profit from artificial monopolies. It decreases the works produced, on purpose, in order to subsidize publishing.
If society decides we no longer want to subsidize publishers with artificial monopolies, we should start with legalizing human creativity. Instead we're letting computers break the law with mediocre output while continuing to keep humans from doing the same thing.
LLMs are serving as intellectual property laundering machines, funneling all the value of human creativity to a couple of capitalists. This infringement of intellectual property is just the more pure manifestation of copyright, keeping any of us from benefitting from our labor.
Few company can amass such quantities of knowledge and leverage it all for their own, very-private profits. This is unprecedented centralization of power, for a very select few. Do we actually want that? If not, why not block this until we're sure this a net positive for most people?
Because they expect not to have to opens-source future models. Easy to open stuff as long as you strengthen your position and prevent the competition from emerging.
Ask Google about Android and what they now choose to release as part of AOSP vs Play Services.
…why? Will people buy less books because we have intuitive algorithms trained on old books?
Personally, I strongly believe that the aesthetic skills of humanity are one of our most advanced faculties — we are nowhere close to replacing them with fully-automated output, AGI or no.
You got less than 1% of a book... from an author who has passed away... who wrote on a research topic that was funded by an institution that takes in hundreds of millions of dollars in federal grants each year...
I'm not an author (although I do generate almost exclusively IP for a living) and I think this is about as weak a form of this argument as you possibly make.
So right back at ya... who was hurt in your example?
I think the key is to think through the incentives for future authors.
As a thought experiment, say that the idea someday becomes mainstream that there is no reason to read any book or research publication because you can just ask an AI to describe and quote at length from the contents of anything you might want to read. In such a future, I think it's reasonable to predict that there would be less incentive to publish and thus less people publishing things.
In that case, I would argue the "hurt" is primarily to society as a whole, and also to people who might have otherwise enjoyed a career in writing.
Having said that, I don't think we're particularly close to living in that future. For one thing I'd say that the ability to receive compensation from holding a copyright doesn't seem to be the most important incentive for people to create things (written or otherwise), though it is for some people. But mostly, I just don't think this idea of chatting with an AI instead of reading things is very mainstream, maybe at least in part because it isn't very easy to get them to quote at length. What I don't know is whether this is likely to change or how quickly.
there is no reason to read any book or research publication because you can just ask an AI to describe and quote at length from the contents of anything you might want to read
I think this is the fundamental misunderstanding at the heart of a lot of the anger over this, beyond the basic "corporations in general are out of control and living authors should earn a fair wage" points that existed before this.
You summarize well how we aren't there yet, but I'd say the answer to your final implied question is "not likely to change at all". Even when my fellow traitors-to-humanity are done with our cognitive AGI systems that employ intuitive algorithms in symphony with deliberative symbolic ones, at the end of the day, information theory holds for them just as much as it does for us. LLMs are not built to memorize knowledge, they're built to intuitively transform text -- the only way to get verbatim copies of "anything you might want to read" is fundamentally to store a copy of it. Full stop, end of story, will never not be true.
In that light, such a future seems as easy to avoid today as it was 5 years ago: not trivial, but well within the bounds of our legal and social systems. If someone makes a bot with copies of recent literature, and the authors wrote that lit under a social contract that promised them royalties, then the obvious move is to stop them.
Until then, as you say: only extremists and laymen who don't know better are using LLMs to replace published literature altogether. Everyone else knows that the UX isn't there, and the chance for confident error way too high.
that was just a metaphor, you can ask your AI what's that or use way less energy and use Wikipedia's search engine... or do you think OpenAI first evaluates if the author is an independent developer &/or has died &/or was funded by a public university before adding the content to the training database? /s
and one thing is publishing a paper with jargon for academics, another is to simplify the results for the masses. there's a huge difference between finishing a paper and a book
It isn't that someone was hurt. We have one private entity gaining power by centralizing knowledge (which they never contributed to) and making people pay for regurgitating the distilled knowledge, for profit.
Few entities can do that (I can't).
Most people are forced to work for companies that sell their work to the higher bidder (which are the very entities mentioned above), or ask them to use AI (under the condition that such work is accessible to the AI entities).
It's obviously a vicious circle, if people can't oppose their work to be ingested and repackaged by a few AI giants.
The answer is to censor the model output, not the training input. A dumb filter using 20 year old technology can easily stop LLM's from verbatim copyright output.
I know that this seems likely from a theoretical perspective (in other words, I would way underestimate it at the sprint planning meeting!), but
A) checking each output against a regex representing a hundred years of literature would be expensive AF no matter how streamlined you make it, and
B) latent space allows for small deviations that would still get you in trouble but are very hard to catch without a truly latent wrapper (i.e. another LLM call). A good visual example of this is the coverage early on in the Disney v. ChatGPT lawsuit:
What if the model simply substitutes synonyms here and there without changing the spirit of the material? (This might not work for poetry, obviously.) It is not such a simple matter.
I think you’re overstating its importance. The internet already makes it possible to order almost any book in existence and have it arrive at your doorstep within a week or so, or often on your ebook reader instantly. And your local library probably participates in an interlibrary loan system that lets you request any book held by any library in the country for free.
LibGen gives you access to a much smaller body of works than either of those. It’s a little more convenient. But the big difference is that it doesn’t compensate the author at all.
And what about the other billions of people on the planet that don't even have a library, let alone a doorstep to receive a first world delivery service.
2. DRM is built in to most purchased ebooks, which means you can’t consume the book on any device. “Illegal” tools exist to circumvent this.
3. Large ebook stores - like other digital stores - essentially lend you a copy of the book. So when they are forced to pull a book, they’ll pull your access too.
Of course, now that the big players have consumed/archived the entire book dump, they can go ahead and kill it to prevent others from doing the same thing.
It is *much* more convenient. When a research path takes me to an article or book - I could buy or order or go to a physical library, that would take hours or days. I could also open it as a PDF in seconds. If you need to read a chapter from a book, or an article, or skim such checking to see if it's worthwhile, 20-30 times to figure something out, then libgen is the difference between finishing in a day or a month.
There are a whole lot of books that are out of print, and if a book went out of print before ebooks were a thing, it probably doesn't have a legal digital edition either.
This. Few people here would remember ebooksclub/gigapedia/smiley/library.nu [1] which predated LibGen by several years. But that online library had a lot of books that are not availble nowadays. There were lots of scanned books (djvu) that people uploaded. So much lost knowledge.
Sure, but I have a strong feeling that scans of out-of-print books only constitute a small portion of LibGen’s traffic.
It’s like the idea that most BitTorrent users are just using it to share free software and Creative Commons media. (See the screenshots on every BitTorrent client’s website.) It would definitely be helpful if it were true, but everyone knows it’s just wishful thinking.
Academics are huge users of LibGen for academic books from the entire past century and beyond. It's infinitely more convenient to instantly get a PDF you can highlight, than wait weeks for some interlibrary loan from an institution three states away.
Just because the majority of people might be downloading Harry Potter is irrelevant.
Libraries can burn down (see Library of Alexandria), civilizations end (see various). LibGen makes it possible for an individual to backup a snapshot of cumulative human knowledge, and I think that's commendable.
> LibGen gives you access to a much smaller body of works than either of those.
> Just go to a real library.
The thrill of waiting a week for a book to arrive or navigating the labyrinthine interlibrary loan system is truly a privilege that many can afford. And who needs instant access to knowledge when you can have the pleasure of paying for shipping or commuting to a physical library?
It's also fascinating that you mention compensating authors, as if the current publishing model is a paragon of fairness and equity. I'm sure the authors are just thrilled to receive their meager royalties while the rest of the industry reaps the benefits.
LibGen, on the other hand, is a quaint little website that only offers access to a vast, sprawling library of texts, completely free of charge and accessible to anyone with an internet connection. I'm sure it's totally insignificant compared to the robust and equitable systems you mentioned.
Your suggestion to "just go to a real library" is also a brilliant solution, assuming that everyone has the luxury of living near a well-stocked library, having the time and resources to visit it, and not having any other obligations or responsibilities. I'm sure it's not at all a tone-deaf, out-of-touch recommendation.
Yes, publishers don’t pay authors as much as they deserve, but LibGen pays them literally nothing. Authors tend to love libraries but hate piracy. Why? Because earning something is better than earning nothing.
Have you ever submitted an ILL request? It’s extremely simple. Many library systems even integrate with WorldCat, so submitting a request for any book just takes a few clicks.
I’m mostly speaking about people in the US. Every single county in the entire country has a public library. Almost all of them have ILL.
I think equity is a fair argument for the existence of services like LibGen in many parts of the world, but the reality is that almost everyone using a book piracy sites in a first-world country is using it to pirate an in-print book that they just don’t want to go to the trouble of borrowing or buying.
Seeing the high prices they are charged for a digital licence which expires after a fairly small number of loans, I feel it'd be better for my library if I pirate when possible. Save those limited loans for someone who prefers/needs them.