Can I download a book without paying for it, and print copies of it? Stash copies in my bathroom, the gym, my office, my bedroom etc. to basically have a copy on hand to study from whenever I have some free time?
Yes, you're allowed to make personal copies of copyright works that you own. IANAL, but my understanding is that if you're using them for yourself, and you're not prevented from doing so by some sort of EULA or DRM, there's nothing in copyright law preventing you from e.g. photocopying a book and keeping a copy at home, as long as you don't distribute it. The test case here has always been CDs—you're allowed to make copies of CDs you legally own and keep one at home and one in your car.
To the best of my knowledge, no individual has ever been sued or prosecuted specifically for downloading books. As long as you're not massively sharing them with others, it's not an issue in practice. Enjoy your reading and learning.
Aaron Swartz, cofounder of Reddit and inventor of RSS and Markdown, was hounded to death by an overzealous prosecutor for downloading articles from JSTOR, with the intent to learn from them. He was charged with over a million dollars in fines and could have faced 35 years in prison.
He and Sam Altman were in the same YC class. OpenAI is doing the same thing at a larger scale, and their technology actually reproduces and distributes copyrighted material. It's shameful that they are making claims that they aren't infringing creator's rights when they have scraped the entire internet.
I'm familiar with Aaron Swartz's case, and that is actually why I phrased it as "books". In any case, while tragic, Swartz wasn't prosecuted for copyright infringement, but rather for wire fraud and computer fraud due to the manner in which he bypassed protections in MIT's network and the JSTOR API. This wouldn't have been an issue if he downloaded the articles from a source that freely shared them, like sci-hub.
It would be incredibly naive to assume that the scraping done for these models did not at any point circumvent protections.
The fundamental contention is that both accessed, saved and distributed material that they didn't have a "right" to access, save, and distribute. One was made a billionaire for it and another was driven to suicide. It's not tragic, it's societal malpractice.
> It's shameful that they are making claims that they aren't infringing creator's rights when they have scraped the entire internet.
Scraping the Internet is generally very different from piracy. You are given a limited right to that data when you access it, and you can make local copies. if further use does something sufficiently non-copying, then creator rights aren't being infringed.
It was overzealous prosecution of the breaking into a closet to wire up some ethernet cables to gain access to the materials
Not the downloading with intent
And apparently the most controversial take on this community is the observation that many people would have done the trial, plea and time, regardless of how overzealous the prosecution was
35 years is a press release sentence. The way DOJ calculates sentences when they write press releases ignores the alleged facts of the particular case and just uses for each charge the theoretically maximum possible sentence that someone could get for that charge.
To actually get that maximum typically requires things like the person is a repeat offender, drug dealing was involved, people were physically harmed, it involved organized crime, it involved terrorism, a large amount of money was involved, or other things that make it an unusual big and serious crime.
The DOJ knows exactly what they are alleging the defendant did. They could easily looks at the various factors that affect sentencing for the charge and see which apply to that case and come up with a realistic number but that doesn't make it sound as impressive in the press release.
Another thing that inflates the numbers in the press releases is that defendants are often charged with several related charges. For many crimes there are groups of related charges that for sentencing get merged. If you are charged with say 3 charges from the same group and convicted on all you are only sentenced for whichever one of them has the longest sentence.
If you've got 3 charges from such a group in the press release the DOJ might just take the completely bogus maximum for each as described above and just add those 3 together.
Here's a good article on DOJ's ridiculous sentence numbers [1].
Here's a couple of articles from an expert in this area of law that looks specifically at what Swartz was charged with and what kind of sentence he was actually looking at [2][3].
Why do you think Swartz was downloading the articles to learn from them? As far as I've seen know one knows for sure what he was intending.
If he wanted to learn from JSTOR articles he could have downloaded them using the JSTOR account he had through his research fellowship at Harvard. Why go to MIT and use their public JSTOR WiFi access, and then when that was cut off hide a computer in a wiring closet hooked into their ethernet?
I've seen claims that he wanted to do was meta research about scientific publishing as a whole which could explain why he needed to download more than he could download with his normal JSTOR account from Harvard, but again why do that using MIT's public WiFi access? JSTOR has granted more direct access to large amounts of data for such research. Did he talk to them first to try to get access that way?
He might have wanted other people to have access to the knowledge, and for free. In comparison, AI companies want to sell access to the knowledge they got by scraping copyrighted works.
Truly wow. The sucking up to coroporations is terrifying. This, when Aaron Swartz was institutionally murdered by the institutions and the state for "copyright infringement". And what he did wasn't even for profit, or even a 0.00001 of the scale of the theft that OpenAI and their ilk have done.
So it's totally OK to rip off and steal and lie through your teeth AND do it all for money, if you're a company.
But if you're a human being, doing it not for profit but for the betterment of your own fellow humans, you deserve to be imprisoned and systematically murdered and driven to suicide.
Thank you for putting my sentiment into words. THIS. It's not power to the people, it's power to the oligarchs. Once you have enough power and, more importantly, wealth, you're welcomed into the fold with open arms. Just how Spotify build a library of stolen music, as long as wealth was created, there is no problem because wealth is just money taken from the people and given to the ruling class.
> Internet people say you can, but there's no actual legal argument or case law to support that.
Quite the opposite. The burden of proof is on you to show a single person ever, in history, who has been prosecuted for that.
If nobody in the world has ever been prosecuted for this, then that means it is either legal, or it is something else that is so effectively equivalent to "legal" that there is little point in using a different word.
If you want to take the position that, "uhhhhhhh, there is exactly 0% chance of anyone ever getting in trouble or being prosecuted for this, but I still don't think its legal, technically!"
Then I guess go ahead. But for those in the real world, those two things are almost equivalent.
> If you want to take the position that, "uhhhhhhh, there is exactly 0% chance of anyone ever getting in trouble or being prosecuted for this, but I still don't think its legal, technically!"
At home? Without ever sharing it with anyone? I thought making backups of things that you personally own was protected, at least in the US. Could you elaborate on my apparent misunderstanding?
This is a specific exception in Australia Copyright law. It allows reproducing works in books, newspapers and periodical publications in different form for private and domestic use.
It seems reasonably within the bounds described by fair use, but nobody's ever tested that particular constellation of factors in a lawsuit, so there's no precedent - hand copying a book, that is.
17 U.S.C. § 107 is the fair use carveout.
Interestingly, digitizing and copying a book on your own, for your own private use, has also not been brought to court. Major rights holders seem to not want this particular fair use precedent to be established, which it likely would be, and might then invalidate crucial standing for other cases in which certain interpretations of fair use are preferred.
Digitally copying media you own is fair use. I'll die on that hill.
It doesn't grant commercial rights, you can't resell a copy as if it were the original, and so on, and so forth.
There's even a good case to be made that sharing a digitally copied work purchased legally, even to millions of people, 5 years after a book is first sold - for a vast majority of books, after 5 years, they've sold about 99.99% of the copies they're going to sell.
By sharing after the ~5 year mark, you're arguably doing marketing for the book, and if we cultivated a culture of direct donation to authors and content creators, it invalidates any of the reasons piracy is made illegal in the first place.
Right now publishers, studios, and platforms have a stranglehold on content markets, and the law serves them almost exclusively. It is exceedingly rare for the law to be invoked in defending or supporting an author or artist directly. It's very common for groups of wealthy lawyers LARPing as protectors of authors and artists to exploit the law and steal money from regular people.
Exclusively digital content should have a 3 year protected period, while physical works should get 5, whether it's text, audio, image, or video.
Once something is outside the protected period, it should be considered fair game for sharing until 20 years have passed, at which point it should enter public domain.
Copyright law serves two purposes - protecting and incentivizing content creators, and serving the interests of the public. Situations where a bunch of lawyers get rich by suing the pants off of regular people over technicalities is a despicable outcome.
> there's no precedent - hand copying a book, that is
Thank you! I had looked this up myself last week, so I knew this. I had long believed, as GP does, that copying anything you own without distribution is either allowed or fair use. I wanted GP to learn as I did.
You're repeating upthread comments. And no, you can't. There's an archival exception for electronic media. If you want to make copies of physical media you either:
1. Can't
Or
2. Rely on fair use to protect you (archival by individuals isn't necessarily fair use)
That's not a one-to-one analogy. The LLM isn't giving you the book, its giving you information it learned from the book.
The analogous scenario is "Can I read a book and publish a blog post with all the information in that book, in my own words?", and under US copyright law, the answer is: Yes.
> The analogous scenario is "Can I read a book and publish a blog post with all the information in that book, in my own words?"
The analogous scenario is actually "Can I read a book that I obtained illegally and face no consequences for obtaining it illegally?" The answer is "Yes" there are no consequences for reading said book, for individuals or machines.
But individuals can face serious consequences for obtaining it illegally. And corporations are trying to argue those consequences shouldn't apply to them.
Not to diminish the atrocity of what happened to Aaron, but is this a highly abnormal case of prosecutor overzeal or is it common for people to be charged and held liable for downloading and/or consuming (without distribution) of copyrighted materials (in any form) without obtaining a license?
Asking because I genuinely don't know. I believe all I've ever read about persecution of "commonplace" copyright violations was either about distributors or tied to bidirectional nature of peer-to-peer exchange (torrents typically upload to others even as you download = redistribution).
Aaron Swartz downloaded a lot of stuff. Did he publish the stuff too? That would be an infringement. But only downloading the stuff? And never distributing it? Not sure if it’s worth a violation .
There's no analogous because the scale of it takes it to a whole different level and degree, and for all intents and purposes we tend to care about level and degree.
Me taking over control of the lemonade market in my neighbourhood wouldn't ever be a problem to anyone, a very minor annoyance; instead if I managed to corner the lemonade market of a whole continent it'd be a very different thing.
The better analogy is "can my business use illegally downloaded works to save on buying a license". For example, can you use pirated copy of Windows in your company? Can you use pirated copy of a book to compute weights of a mathematical model?
> Can I download a book without paying for it, and print copies of it?
No, but you can read a book, learn its contents, and then write and publish your own book to teach the information to others. The operation of an AI is rather closer to that than it is to copyright violation.
"Should" there be protections against AI training? Maybe! But copyright law as it stands is woefully inadequate to the task, and IMHO a lot of people aren't really treating with this. We need a functioning government to write well-considered laws for the benefit of all here. We'll see what we get.
Yes, but the learning isn't constrained by those laws. If I steal a book and read it, I'm guilty of the crime of theft. You can put me in jail, try me before a jury, fine me, and put me in prison according to whatever laws I broke.
Nothing in my sentence constrains my ability to teach someone else the stuff I learned, though! In fact, the first amendment makes it pretty damn clear that nothing can constrain that freedom.
Also, note that the example is malformed: in almost all these cases, Meta et. al. aren't "stealing" anything anyway. They're downloading and reading stuff on the internet that is available for free. If you or I can't be prosecuted for reading a preprint from arXiv.org or whatever, it's a very hard case to make that an AI can.
Again, copyright isn't the tool here. We need better laws.
Sure, but OpenAI (same as Google, and Facebook, and all the others) is illegally copying the book, and they want this to be legal for them.
It's perhaps arguable whether it's OK for an LLM to be trained on freely available but licensed works, such as the Linux source code. There you can get in arguments about learning vs machine processing, and whether the LLM is a derived work etc
But it's not arguable that copying a book that you have not even bought to store in your corporate data lake to later use for training is a blatant violation of basic copyright. It's exactly like borrowing a book from a library, photocopying it, and then putting it in your employee-only corporate library.
One thing is downloading pirated copy and reading it for yourself and another thing is running a business based on downloading millions of pirated works.
Yes, but this is not the right model. What OpenAI wants is to borrow a book, make a copy of it, and keep using that copy, in training their models. This is fully and simply illegal, under any basic copyright law.
Let's say Windows is downloadable from Microsoft website. Can you use it for free in your company to save on buying a license? Is it ok to use illegal copies of works in a business?
To the extent that this is how libraries function, yes.
The part of that which doesn't apply is "print copies", at least not complete copies, but libraries often have photocopiers in them for fragments needed for research.
AI models shouldn't do that either, IMO. But unlimited complete copies is the mistake the Internet Archive made, too.
I missed the part where we throw away rational logic skills
Have you never been to a public library and read a book while sitting there without checking it out? Clearly, age is a factor here, and us olds are confused by this lack of understanding of how libraries function. I did my entire term paper without ever checking out books from the library. I just showed up with my stack of blank index cards, then left with the necessary info written on them. Did an entire project on tracking stocks by visiting the library and viewing all of the papers for the days in one sitting rather than being schmuck and tracking it daily. Took me about an hour in one day. No library card required.
Also, a library card is ridiculously cheap even if you did decide to have one.
> Have you never been to a public library and read a book while sitting there without checking it out?
See my comment here: https://news.ycombinator.com/item?id=43355723. If OpenAI built a robot that physically went into libraries, pulled books off shelves by itself, and read them...that's so cool I wouldn't even be mad.
What about checking out eBooks? If you had an app that checked those out and scanned it at robot speed vs human feed, that would be the same thing. The idea that reading something that does not belong to you directly means stealing is just weird and very strained.
theGoogs essentially did that by having the robot that turned each page and scanned the pages. that's no different than having the librarian pull material for you so that you don't have to pull the book from the shelf yourself.
There's better arguments to make on why ClosedAI is bad. Reading text it doesn't own isn't one of them. How they acquired the text would be a better thing to critique. There's laws for that in place now that does not require new laws to be enacted.
If I spent every last second of my life in a public library, I couldn't even view a fraction of the information that OpenAI has ingested. The comparison is irrelevant. To make the comparison somehow valid, I'd have to back up my truck to a public library, steal the entire contents, then start selling copies out of my garage
Look, even I'm not a fan of ClosedAI, but this is ridiculous. ClosedAI isn't giving copies of anything. It is giving you a response it infers based on things it has "read" and/or "learned" by reading content. Does ClosedAI store a copy of the content it scrapes, or does it immediately start tokenizing it or whatever is involved in training? If they store it, that's a lot of data, and we should be able to prove that sites were scraped through lawsuit discovery process. Are you then also suggesting that ClosedAI will sell you copies of that raw data if you prompted correctly?
I'm in no way justifying anything about GPT/LLM training. I'm just calling out that these comparisons are extremely strained.
Let's say OpenAI developers use illegal copy of Windows on their laptops to save on buying a license. Is that ok to run a business this way?
Also I think it is different thing when someone uses copyrighted works for research and publishing a paper or when someone uses copyrighted works to earn money.
I don't need a card to read in the library, nor to use the photocopiers there, but it's merely one example anyway. (If it wasn't, you'd only need one library, any of the deposit libraries will do: https://en.wikipedia.org/wiki/Legal_deposit).
You also don't need permission, as a human, to read (and learn from) the internet in general. Machines by standard practice require such permission, hence robots.txt, and OpenAI's GPTBot complies with the robots.txt file and the company gives advice to web operators about how to disallow their bot.
How AI should be treated, more like a search index, or more like a mind that can learn by reading? Not my call. It's a new thing, and laws can be driven by economics or by moral outrage, and in this case those two driving forces are at odds.
How so? I don't have to pay to read most websites. To read most books I have to pay (or a library has to pay and I have to wait to get the book).
> IIRC, Google already did your sidenote
Not quite. They had to chop the spines off books and have humans feed them into scanners. I'm talking about a robot that can walk (or roll) into a library, use arms to take books off the shelves, turn the pages and read them without putting them into a scanner.
owning a copy and learning the information is not the same. you can learn 2+2=4 from a book, but you no longer need that book to get that answer. each year in school, I was issued a book for class, learned from it, returned the book. I did not return the learning.
musicians can read the sheet music and memorize how to play it, and no longer need the music. they still have the information.
But you still need to buy the sheet music first, all the AI Labs used pirated materials to learn from.
There's two angles to the lawsuits that are getting confused - the largest one from the book publishers (Sarah Silverman et al) attacked from the angle that the models could reproduce copyrighted information. This was pretty easily quelled / RHLF'd out (used to be that if ChatGPT started producing lyrics a supervisor/censor would just cut off it's response early - tried it now and ChatGPT.com is now more eloquent, "Sorry, I can't provide the full lyrics to "Strawberry Fields Forever" as they are copyrighted. However, I can summarize the song or discuss its themes, meaning, and history if you're interested!")
But there's also the angle of "why does OpenAI have Sarah Silverman's book on their hard drive if they never paid her for it? This is the lawsuit against Meta regarding books3 and torrenting, seems like they're getting away with the "we never redistributed/seeded!" but it's unclear to me why this is a defense against copyright infringement.
Not only would the musician have to buy the sheet music first, but if they were going to perform that piece for profit at an event or on an album they'd need a license of some sort.
This whole mess seems to be another case of "if I can dance around the law fast enough, big enough, and with enough grey areas then I can get away with it".
As a student in a school band that debated whether to choose Pirates of the Caribbean vs Phantom of the Opera for our half time show, I remember the cost of the rights to the music was a factor in our decision.
The school and library purchased the materials outright, again, OpenAI Meta et al never paid to read them, nor borrowed them from an institution that had any right to share.
I'm a bit of an anti intellectual property anarchist myself but it grinds my gears that, given that we do live under the law, it is applied unequally.
if you have evidence that openAI is doing this with books that are not freely available, i'm sure the publishers would absolutely love to hear about it.
We know Meta has done it. These companies have torrented or downloaded books that they did not pay for. Things like the The Pile, libgen, anna's library were scraped to build these models.
>if you have evidence that openAI is doing this with books that are not freely available, i'm sure the publishers would absolutely love to hear about it.