Hacker News new | ask | show | jobs
by foob 1070 days ago
From the recent story about the Sarah Silverman lawsuit:

The complaint lays out in steps why the plaintiffs believe the datasets have illicit origins — in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.” Bibliotik and the other “shadow libraries” listed, says the lawsuit, are “flagrantly illegal.”

IANAL, but this basically sounds like LLaMa was trained on illegally obtained books by Meta's own admission. It's an exciting development that Meta is releasing a commercial-use version of the model, but I wonder if this is going to cause issues down the road. It's not like Meta can remove these books from the training set without retraining from scratch (or at least the last checkpoint before they were used).

[1] https://news.ycombinator.com/item?id=36657540

8 comments

Sometimes, I wonder what if someone in XYZ country downloads whole of Z-Library/Libgen, all the books ever printed, and all the papers ever published, all the newspapers and so on. and releases the model open source. There are jurisdictions with Lax rules.

And they will have much better knowledge, answers, etc than the western, Lawyer approved models.

Sometimes knowledge needs to be set free I guess.

The production of knowledge needs to be funded as it isn’t “free”. Copyright and licensing is one model that has worked for a long time. It has flaws, but it has produced good things.

At this point with the quality of current web content and the collapse of journalism as an industry I think we can say online ads have utterly failed as a replacement income stream.

Unless you want all LLM to say “I’m sorry the data I was trained on ends in 2023” you still need a content funding model. Maybe not copyright, but sure as hell not ads either.

"Copyright and licensing is one model that has worked for a long time. It has flaws, but it has produced good things."

By some definition of "worked". If we define "worked" as "made money for", who it worked mostly for are the middlemen and a minority of writers... a minority that with the advent of LLMs is likely to shrink even further.

Not friends with any journalists I’m assuming?
There aren't many of them left.
You state this as a fact, but it's actually much less certain wherever it's ever been net-positive.

It was probably intended that way, but the reality is that the power has been with the publisher since the beginning, and they've absolutly been screwing over the author's as well. Only the most successful author's have gotten decent deals.

I don't have an answer to this either though, i just wanted to point out that copyright has arguably never been successful at getting money to the content creators proportional to the value the Publisher extracted from the work either.

The only way you’d know is to A/B test with a country with no copyright, and see how their authors get by.

My guess is extremely poorly. Again, the biggest might be fine. Instead of publishers paying fairly little to authors they could just literally take the best books and print them, taking all of the profits…not to mention ebooks.

I’m not an author so I can’t speak to how much publishers make, but I’d assume that if one was way better than the others in how much they’re distribute to authors all of the best authors would jump ship. Markets have a way of working things out.

A lot of people want to be authors, and any time that happens - game dev, teachers, musicians, etc. - you’re going to take on a bit of extra hardship compared to other jobs.

> The only way you’d know is to A/B test with a country with no copyright, and see how their authors get by.

According to https://www.spiegel.de/international/zeitgeist/no-copyright-... we already had that A/B experiment.

I'm not saying that it would be better for authors without copyright. That would indeed be hard to ascertain without a/b testing.

My point was that it doesn't improve their lives, and that's much easier to check in isolation just by reading the news about the current writers strike and how the industry just ignores it until fall, expecting their savings to run out.

Really, copyright just doesn't give the content creators any meaningful power as this right is generally owned by the industry/publisher, not the authors.

The production of knowledge (I assume you're mainly talking about scientific research here) is absolutely not funded by copyright royalties or anything like that.

Journals get their content for free. Actually often they charge the authors for it.

Research is mainly funded by governments and taxes.

Industrial R&D is actually almost 3x larger than government funded R&D.

https://www.brookings.edu/articles/rd-for-the-public-good-wa....

Yeah fair, but still 0% funded by copyright.

Industrial R&D also tends to me more "research for hire" rather than pure research. A bit closer to consulting.

Anyway my point still stands.

But again, "funding" is merely common and/or one step in the process. It's not always necessary and is definitely never sufficient, and I think when you bring it up, the mental model that people have is of the incorrect scale?

Put differently, we consider -- but don't think a whole lot about -- about Wikipedia's "funding," because that's NOT the most important part/innovation of that model.

We should better answer what is?

>The production of knowledge needs to be funded as it isn’t “free”. Copyright and licensing is one model that has worked for a long time.

Can you give some examples of new knowledge that was copyrighted? Generally copyright is used to protect art, software and textbooks. People who produce new knowledge generally are not paid by copyright. The knowledge is either kept secret or published in a journal from which the author recieves no compensation.

Training and copyright is going to be interesting, people can be trained on “illegally obtained” books too yet you’ll probably going to be hard pressed to make an argument that any employee who downloaded a book or a paper from “libre library” could be used as fruit of the poisonous tree argument down the line.
If the company supplied the employee with the “illegally obtained” books, that could be reason to view the situation differently than an employee acting on their own.

Since the company is obtaining + providing these models with 100% of their input data, it could be argued they have some responsibility to verify the legality of their procurement of the data.

its not deemed illegal yet

its in a weird place imo, with japan ruling that anything goes for AI data, other countries are put under pressure to allow the same

ie,

you're allowed to scrape the web

you're allowed to take what you scrape and put it in a database

you're allowed to use your database to inform on decisions you might make, or content you might create

but once you put AI model in the mix, all of a sudden there's problems, despite the fact that making the model is 10000% harder than doing all of the points mentioned above, the problem of using someone else's work somehow becomes a problem when it never was before

and if truly free and open source LLMs come into the game, then might the corporate ones become crippled from copyright? that's bad for business

> It's not like Meta can remove these books from the training set without retraining from scratch (or at least the last checkpoint before they were used).

They probably can:

https://github.com/zjunlp/EasyEdit

> I wonder if this is going to cause issues down the road.

There are some popular Stable Diffusion models, being run in small businesses, that I am certain have CSAM in them because they have a particular 4chan model in their merging lineage.

... And yet, it hasn't blown up yet? I have no explanation, but running "illegal" weights seems more sustainable than I would expect.

I’ve been wondering when the landmark moral panic would start against Civit.AI and the coomer crowd. People have no idea just how much porn is being produced by this stuff. One of the top textual inversions right now is a… age slider… (https://civitai.com/models/65214/age-slider) ewww. It’s also extremely well rated and reviewed on there. I’m terrified at the impending backlash because depending on what happens the party going on in AI could end
People have been saying this about underage hand drawn hentai forever, but its still around.

Not that I am disagreeing with you. What I find particularly disturbing are the paid services for this.

Also, I have seen 2 seperate OnlyFans pimps ask for help in a text generation chatroom. Something about automating "private" texting from their "girls."

It’s trivial to use these methods to produce real looking images, or even stuff in the likeness of real people…
Yeah. I did a fine tuned model of my daughter and niece and I definitely have to put in “sexy, naked,” and the like in the negative prompt when using them.

I don’t think society is going to have a hissyfit until some app comes along that makes it super easy for people to train good models locally on people and then generate whatever they want. That day’s coming really soon though.

There are tons of web services for this. They are just obscure and distributed enough to avoid public ire.

The pieces to do local LORA training are all there, but honestly the tyranny of CUDA is the biggest blocker for the average person.

That is not at all the same thing as removing the books.
> They probably can:

No, actually they probably can’t. There is no verifiable way to remove the data from the model apart from completely removing all instances of information from the training data. The project you linked only describes a selective finetuning approach.

Until you get models with completely disentangled feature spaces such that you know that the influence of a piece of data is completely removed (at the limit this is something like an embedding DB), there is absolutely no way you can claim you’ve removed the data from the model.

At most, these efforts will amount to data laundering where it will be impossible to prove that a piece of data was used to train the model, not provide conclusive proof that it was removed.

Which means we are probably at least 5-10 years away from verifiable action that a court of law will recognize.
This assumes it's possible. I naively assume it's not, in a way that doesn't harm the model, beyond the content of the book.
They can probably prevent LLaMA from spitting out verbatim quotes from the books well enough to make proof difficult.

... But yeah, fundamentally the only way to throw out the books is to throw out the weights.

that is quite the spicy claim
If we accept the argument that you can train a ML model on data scraped from the internet because the model is sufficiently transformative and thus isn't impacted by the copyright of that data, then how does that change simply because somebody else distributed the data illegally? Either the ML model breaks the copyright chain or it doesn't. Or is the argument that using data that was provided to you in violation of copyright is illegal in general?
How is it different than training from random blogs, or stack overflow or in general "The Internet"?
Really, really bad look for Eleuther if this is true. I did not expect them do something like this and not even see the issue with it.
Move fast and break the law.
It's far from certain at this stage whether this does break the law.
While this may be true, the reverse is also true, and even if it’s legal, there are other ways to frame this that are worth considering, e.g. It could technically be legal, but not in accordance with the spirit of the law. Updates to laws are required. The fact that the model is legal is an additional problem on top of the gap in the law.

I think my main point here is that “legal” does not imply moral or acceptable to society, and our understanding of the technical legal status is not a prerequisite for exploring those factors, which may be the thing that changes the legal status in response to the major shift in landscape.

Right but if you have a plausible case you weren't breaking the law and it was a legal unknown the most that will happen is "we've decided this is officially illegal, stop doing it."

You risk nothing by assuming things are legal until explicitly illegal.

If you limit the framing of the conversation to that of an amoral corporate entity, sure. But I don’t think there was ever a question that companies can legally do things that are potentially (or unequivocally) distasteful if not outright unethical/immoral.

More interesting is the broader conversation which involves society’s response to a major shift in the information economy, new questions about what role these tools should play, and how laws should evolve accordingly.

The factors surrounding the emergence/unfolding of AI tooling can’t be stripped down to just the corporate interests involved.

Copyright laws should be amended to allow this scenario. If I read a book and write about it in a blog, it is considered review. Why shouldn’t we allow companies to do the same to train their models? Overall it will benefit society more than it hurts some rich authors.
I think it’s a mistake/fallacy to equate the human acquisition of knowledge and resulting synthesis of value with that of large-scale computers ingesting the sum total of written human knowledge and the outcomes that enables.

They are not similar, and I suspect that if they were (i.e. humans could absorb that much information), the information landscape and the market models for exchanging value would look nothing like they do today, and AI wouldn’t be rocking the boat, it’d just be another adherent to the resulting laws.

That's one thing I'm consistently surprised HN fails to draw a distinction on: copyright regimes are fundamentally about copy rate.

You can't take a regime that works decently with human-rate copying and convert it to computer-rate copying, because fundamentally the give-and-take of rights to each side is balanced against feasible limits of reproduction.

Or, to put it another way, if you can copy/synthesize at most 1 book a day, I can extend you a lot more implicit rights... than I can afford to someone who can copy/synthesize every book ever in a day.

I think the difference is you presumably obtained that book legally before writing the review. In this case the book was pirated (the definitely illegal part), and then used for training (the possibly illegal part, but I suspect this would be deemed fair use).

IMO google and their massive google books DB would have a better leg to stand on here if they trained on that dataset as they owned physical copies of all the books.

I don't think it matters. Your review isn't copyright infringement because you pirated the movie.
>Copyright laws should be amended to allow this scenario. If I read a book and write about it in a blog, it is considered review.

The problem with current AI is that they memorize stuff, there is the case with the AI memorizing an algorithm perfectly, or reciting quotes from Dune and then getting censored.

Now you as a paying user of this AI tools are not making reviews but probably using them for commercial purposes and it would not be fiar if your proprietary code would use code copy pasted from GPL code.

If this AI would be so clever then IMO you could have them laarn say Python exactly like a human, a few books and some exercises on python, some books on algorithms, some books on html or whatever tech. But today they train with the full github and you get a mix of stuff. My suggestion would also improve the sorry state of JS in ChatGPT where it uses super old syntax and still uses outdated pattern like it is coding for IE6. My guess this is because it is train with old or bad code and this mean a=most of the code from now one will be old syntax and bad

“Rich authors”.

Citation needed.

I meant the authors that are suing - if you have the money to sue, you can be considered rich? no?
Going to go with “no, you don’t need to be rich to sue”. Likewise to be included in a class action you don’t have to pay anything, or even participate any way, you just get a cut of the settlement.
Couldn’t they just buy the ebook and call it a day? The rich people are the people training LLMs not the authors lol
I doubt it makes a difference whether they purchase the ebook or not. And probably a bunch of them aren't even available as ebooks legitimately, people scan books and upload them to zlibrary etc.
It worked for Uber!
It certainly worked for their founders.
And their customers. Show of hands: who wants to go back to taxis?
Most large datasets are full of copyrighted content. They aren’t unique.
It seems difficult to argue that Meta can copy every ebook in existence to train a model, but then other people cannot copy the resulting model.