Hacker News new | ask | show | jobs
by mattcantstop 636 days ago
I am very likely in the minority here, but I think AI SHOULD be trained on everything that is in the public sphere. I'd be disappointed if it wasn't trained on everything they had access to.

If it is trained on private information, then I would have issue with it.

14 comments

I don't agree because it creates this dilemma for creators: you need to put your work out there to get traction, but if you put your work out there and anything public is fair game, then it will be sampled by a computer and instantly recreated at scale. This might even happen without the operator knowing whose work is being ripped off.

Commercial art producers have always ripped off minor artists. They would do it by keeping it very similar to the original but just different enough to avoid being sued. Despite this, I personally know two artists who have sued major companies who ripped off their work for ads, and both won million-plus settlements. Why would we embrace this now that a computer can do it and there's a level of deniability? I don't understand how this benefits anyone.

> Why would we embrace this now that a computer can do it and there's a level of deniability?

Generally I don't think people are arguing that copyright law should be more lenient to AI than it is to humans. If your work gets ripped off (a substantially similar copy not covered by fair use) you can sue regardless of tools used in its creation.

Question would be whether machine learning, unlike human learning, should be treated as copyright infringement. There are differences and the law does not inherently need to treat them the same, but it could.

As to why it should: I think there's huge benefit across a large range of industries to web-scale pretraining and foundation models, and I'd like it to remain accessible to open-source groups or smaller companies without huge data moats. Realistically I think the alternative would likely just benefit Getty/Universal with near-identical outcomes for most actual artists.

When the very basis of copyright is for the "progress of sciences and useful arts", it seems backwards to use it in a way that would set back advances in language translation, malware/spam/DDoS filtering, defect detection, voice dictation/transcription, medical image segmentation, etc.

> Question would be whether machine learning, unlike human learning, should be treated as copyright infringement.

No, the question is whether those genAI we have around are mass copyrights violation machines or whether they "learn" and build non-violating work.

And honestly, I have seen evidence pointing both ways. But the "copyrights protection" institutions are all quickly to decide the point dismissing any evidence on philosophical basis.

> No, the question is whether those genAI we have around are mass copyrights violation machines or whether they "learn" and build non-violating work.

I refer to the training process in question, which may or may not be be violating copyright, as "machine learning" since that's the common terminology. Question is whether that process is covered by fair use. Whether or not it actually "learn"s is not irrelevant, but I'd say more a philosophical framing than a legal one.

> I refer to the training process in question

Yeah, you go for the red herring.

All of the worthwhile debate is about the real violations. But the public discourse is surely inundated with that exact red herring.

I addressed model output (infringes copyright if substantially similar, as with manually-created works) and the process of training the model (requires collating/processing ephemeral copies, possibly fair use). What do you think the "real violations" are, if not those?
> Generally I don't think people are arguing that copyright law should be more lenient to AI than it is to humans. If your work gets ripped off (a substantially similar copy not covered by fair use) you can sue regardless of tools used in its creation.

With humans, copyright law deals with knowing and intentional infringement more severely than accidental and unintentional infringement.

With an AI, any infringement on the part of the AI end-user is very likely going to be accidental and unintentional rather than knowing and intentional, so the legal system is going to deal with it more leniently, even if actual infringement is proven. The exception would be if you deliberately prompted it to create a modified version of a pre-existing copyrighted work.

With humans, whether infringement is knowing or not, intentional or not, can turn into a massive legal stoush. Whereas, if you say it is AI output, and it appears to actually be AI output, it is going to be much harder for the plaintiff (or prosecution) to convince the court that infringement was knowing and intentional.

> but if you put your work out there and anything public is fair game, then it will be sampled by a computer and instantly recreated at scale.

That's just how the internet works. Don't put something on the internet if you don't want it to be globally distributed and copied.

> I personally know two artists who have sued major companies who ripped off their work for ads, and both won million-plus settlements.

Ultimately "AI did it" should never be allowed to be used as an excuse. If a company pays for a marketing guy who rips off someone's work and they can be sued for it, then a company that pays for an AI that rips off someone's work should still be able to be sued for it.

> That's just how the internet works. Don't put something on the internet if you don't want it to be globally distributed and copied

Until now, this has been an acceptable tradeoff because there's some friction to theft. Directly cloning the work is easy, but that also means an artist can sue or DMCA. It also means the original artist's work can go more viral, which, despite the short-term downsides, can help their popularity long term.

The important difference is that imitating an artist's style with new work used to take significant time (hours or days). With an LLM, it takes milliseconds, and that model will be able to churn out the likes of your work millions of times per day, forever. That's the difference, and why the dilemma is new.

> Ultimately "AI did it" should never be allowed to be used an as an excuse

With the exception of an LLM directly plagiarizing, the only way to prove it didn't is by not allowing it to train on something. LLMs are the sum of everything. We could say the same about humans, sure, we are a model trained on everything we've ever seen too. But humans aren't machines who can recreate stuff in the blink of an eye, with nearly perfect recall, at millions of qps.

"That's just how the internet works" is nonsensical when AI is changing how the internet works.

Just because the tradeoffs of sharing on the internet used to work before AI, doesn't mean those tradeoffs continue to be workable after AI.

It's like having drones follow everyone around and publish realtime telephoto video of them because they have "no expectation of privacy" in public places.

Maybe before surveillance tech existed, there was no expectation of privacy in public places, but now that surveillance tech exists, people naturally expect that high-res video of their every move won't be collected, archived and published even if they are in public.

> Maybe before surveillance tech existed, there was no expectation of privacy in public places, but now that surveillance tech exists, people naturally expect that high-res video of their every move won't be collected, archived and published even if they are in public.

currently, that'd be an unrealistic expectation. I'd agree that it would be nice if that wasn't the case but laws need to catch up with technology. Right now, AI doesn't change things too much since a company who publishes something that violates copyright law is still breaking the law. It shouldn't matter if an AI was used to create the infringing copy or not.

I'm all for new laws giving extra rights to people on top of what we already have if needed, but generally copyright law is already far too oppressive so I'd need to consider a specific proposed law and its impacts.

The topic of expectations reminds me of this article

https://spectrum.ieee.org/online-privacy

Yup, this is just a new-age tragedy of the commons. As soon as armies of sheep come to graze, or consume your content, the honeymoon's over.
> With an LLM, it takes milliseconds, and that model will be able to churning out the likes of your work millions of times per day, forever.

AI does cause a lot of problems in terms of scale. The good news is that if AI churns out millions of copies of your copyrighted works you're entitled to compensation for each and every copy. In addition to pushing out copies of copyrighted material, AI is also capable of writing up DMCA notices and legal paperwork.

> With the exception of an LLM directly plagiarizing, the only way to prove it didn't is by not allowing it to train on something. LLMs copy everything and nothing at the same time.

An AI's output should be held to the exact same standard as anyone else's output. If it's close enough to someone else's copyrighted work to be considered infringing then the company using that AI should be liable for copyright infringement the same way they would be if AI had never been involved. AI's ability to produce a large number of infringing works very quickly might even be what causes companies to be more careful about how they use it. Breaking the law at speeds approaching the speed of light isn't a good business model.

Outside of competing profit motives, there is no dilemma. It's that underlying motive, and it's root, that will have to undergo a drastic change. The Pandora that AI is is already out of the box and there's no putting it back in; only dealing with the consequences.
> That's just how the internet works. Don't put something on the internet if you don't want it to be globally distributed and copied.

You could make the same argument about paper. "That's just how photocopiers work! If you don't want your creations to be endlessly duplicated and sold, don't write them down!" Heck, you could make the same argument about leaving the house. "That's just how guns work! Don't go out in public if you don't want to take the risk of getting shot!"

But it's a bad argument every time. That something is technically possible doesn't make it morally right. It's true that a big point of technology is to increase an individual's power. But I'd say that increased power doesn't diminish our responsibility for our actions. It increases it.

> You could make the same argument about paper. "That's just how photocopiers work! If you don't want your creations to be endlessly duplicated and sold, don't write them down!"

No, the argument would be about photocopies, not paper. "That's just how photocopiers work! Don't put something into a photocopier if you don't want photocopies of it." It isn't possible for anyone to access anything on the internet without making copies of that thing. Copies are literally how the internet works.

Shooting everyone who steps outside isn't how guns work either so that also fails as an analogy.

The internet was specifically designed for the global distribution of copies. If that isn't what you want, don't publish your works there.

> That something is technically possible doesn't make it morally right.

Morality is entirely different from how the internet works, but in practice, I don't see anything immoral about making a copy of something. Morality only becomes an issue when it comes to what someone does with that copy.

> If that isn't what you want, don't publish your works there.

"Women are oppressed in Iran. Well, that's just how Iran is. Just leave it if you don't want to be oppressed"

Oh my. Yea, and whatever is some way, is that way – "it is how it is, deal with it". It's an empty statement. The topic is an ethical and political discussion in light of current technologies. It's a question of whether it should work this way. That's how all moral questions come about – by asking if something should be the way it is. And the current state of technology brings a dilemma that hasn't existed before.

And no, the internet was not designed for that. Quite obviously. Sounds like you haven't heard of private messages.

I'm very surprised this has to be stated.

> "Women are oppressed in Iran. Well, that's just how Iran is. Just leave it if you don't want to be oppressed" Yea, and whatever is some way, is that way – "it is how it is, deal with it". It's an empty statement.

No, because Iran can stop oppressing women and still exist as a functional country. oppressing women today is "how it is". The internet on the other hand is designed to be a system for the distribution of copies. That isn't "how it is", but rather "what it is".

The internet cannot do anything except distribute copies and anything that doesn't distribute copies wouldn't be the internet.

> Sounds like you haven't heard of private messages.

Private messages are also not what is being discussed here. The comment being discussed said: "I don't agree because it creates this dilemma for creators: you need to put your work out there to get traction, but if you put your work out there and anything public is fair game, then it will be sampled by a computer and instantly recreated at scale."

"anything public". For what it's worth though, private messages are still copies.

Yes, if one over-narrowly construes any analogy, it can be quickly dismissed. I suppose that's my fault for putting an analogy on the internet.

We've had copying technologies since people invented the pen. It was such an important activity that there were people who spent their whole lives copying texts.

With the rise of the printing press, copying became a significant societal concern, one so big that America's founders put copyright into the constitution. [1] The internet did add some new wrinkles, but if anything the surprise is is that most of the legal and moral thinking that predates it translated just fine to the internet age. That internet transmission happens to make temporary copies of things changed very little, and certainly not the broad principles.

I understand why Facebook and other people lining their pockets would like to claim that they are entitled to take what they want. But we don't have to believe them.

[1] https://constitution.congress.gov/browse/essay/artI-S8-C8-1/...

I don't think that facebook should be allowed to violate copyright law, but clearly they have the same rights as you do to copy works made publicly avilable on the internet.
> You could make the same argument about paper.

Most paper doesn't come with Terms and Conditions that everything you write on it belongs to the paper company. I hate Facebook (with a fiery passion) but people gave them their data in exchange for the groundbreaking and unprecedented ability to make friends with another person (which has never been done before). It sucks, but don't use these "free" systems without understanding the sinister dynamics and incentives behind them.

People make the same arguments about the NSA. "They aren't doing anything bad with the data their collecting about every US citizen." Well, at some point they will. Stop borrowing against future freedom for a tiny bit of convenience today.

I think you're confusing a legal point (whether a T&C really gives Facebook any particular legal right in court) with the moral question of whether or not people should just roll over for large companies because of language we all, Facebook included, know that nobody ever reads.

Even if FB's T&C made it clear they could do this (something I haven't seen proven), that at best means people would have a hard time suing as individuals. They can still get upset. They can still protest to the regulators and legislators whose job it is to keep these companies in line, and who create the legal context that gives a T&C document practical meaning.

> That's just how the internet works. Don't put something on the internet if you don't want it to be globally distributed and copied.

And if someone takes a picture of your artwork, or takes a picture of your person, and posts that to the internet without your consent? Have you given up your rights then?

My answer: Absolutely not.

What AI does is much more like the Old Masters approach of going to a museum and painting a copy of a painting by some master whose technique they wish to learn. This has always been both legal, and encouraged.

Or borrowing a thick stack of books from the library, reading them, and using that knowledge as the basis for fiction. That's a transformative work, and those are fine as well.

My take is that training AI models is a bespoke copyright situation which our laws were never designed to handle, and finding an equitable balance will take new law. But as it stands, it's both legal and encouraged for a human to access a Web site (thereby making a copy) and learn from the contents of that website.

That is, fundamentally, what happens when an LLM is trained on corpus data. The difference in scale becomes a difference in kind, but as I said, our laws at present don't really account for that, because they weren't designed to.

LLMs sometimes plagiarize, which is not ok, but most people, myself included, wouldn't consider the dilemma satisfactorily resolved if improvements in the technology meant that never happened. Outside of that, we're talking about a new kind of transformative work, and those are legal.

> This has always been both legal, and encouraged.

Not always. The copy must be easily identifiable as copy. An exact reproduction can't have the same dimensions as the original for example.

Drawing just a person or a detail of the picture, or redoing the picture in a different context or style, is encouraged.

Selling a full scale photo of the picture is forbidden. The copyright of famous art belongs to the museum.

The second example is better than the first, yes. I was thinking about the process more than the fact that painting a study produces a work, and a derived one at that, so more normal copyright considerations apply to the work itself.

> An exact reproduction can't have the same dimensions as the original

This is a rule, not a law, and a traditional and widespread one. Museums don't want to be involved in someone selling a forgery, so that rule is a way of making it unlikely. But the difference between "if you do this a museum will kick you out" and "this is illegal" is fairly sharp.

> The copyright of famous art belongs to the museum.

Not in a great number of cases it doesn't, most famous art is long out of copyright and belongs to the public domain. Museums will have copyright on photos of those works, and have been known to fraudulently claim that photos taken by others owe a license fee to the museum, but in the US at least this isn't true. https://www.huffpost.com/entry/museum-paintings-copyright_b_...

Nice scapegoating Anthropomorphized.

Correct analogy is like someone taking pictures of the paintings, going home and applying a photoshop filter, erasing the original signature and adding theirs.

The law already covers that very much so.

If someone takes a picture of me while I'm in public that picture is their copyrighted work and they have every right to post that on the internet. There is no expectation of privacy in public, and Americans have very few rights against other people using photos/video of them (there are some exceptions for things like making someone into your company's spokesperson against their will)

If someone took a photo of my copyrighted work, their photo becomes their copyrighted work. They also have a right to post that picture on the internet without my consent. Every single person who takes a picture of a painting in a museum and posts it to social media is not a criminal. There are legal limitations there too however and that's fine because we have an entire legal system created to deal with that which didn't go away when AI was created.

If a company uses AI to create something that under the law violates your copyright you can still sue them.

> That's just how the internet works. Don't put something on the internet if you don't want it to be globally distributed and copied.

This is true for average people. Is it true for the wealthy? Is it true for Disney? Does our law acknowledge this truth and ensure equal justice for all?

It's 100% true for everyone. You can't access anything at disney.com without making a copy of that thing. Disney can't access anything at yourdomain.whatever without making a copy of that thing.

Whatever crimes either of you can get away with using your copies is another matter entirely. Any rights you had under the legal system you had before AI haven't gone away, neither have the disadvantages you have against the wealthy.

One of the comments you replied to was complaining that their work would be copied and used in training LLMs or other lucrative algorithms, and then you responded taking about how it's common to temporarily copy data into RAM to show a web page. Those are very different, and bringing up such technical minutia is not helpful to the discussion.

If someone asks "how can I share my work online without it being copied?", "actually, you can't share it without people copying it into RAM" is not the answer they're looking for. That answer it too technical, too focused on minutia, and our laws recognize that.

The point is that "copies" was never the problem. "sampled by a computer and instantly recreated at scale" is the expected outcome of publishing something publicly on the internet.

Their problem was copyright infringement and like you said, our laws recognize that problem. We have an entire legal framework for dealing with companies that publish infringing copies of copyrighted works. None of that has changed with LLMs.

If a company publishes something that violates copyright law they can be sued for it, it shouldn't matter if an AI was involved in the creation of what was published or not.

> That's just how the internet works. Don't put something on the internet if you don't want it to be globally distributed and copied.

Or we could be ethical and encourage others to be ethical.

I see you're one of the ones that wouldn't download a car.
I would share a car I had rights to, and download a car made free to me. Facebook would certainly sue me if it were their car, they should thus be held to that standard in my personal opinion.
We could make a distinction between individuals and companies doing it
Depends on the risk assesment but I'd say I'm a lot more like Robin Hood. Facebook is obviously Prince John.
Okay, but that doesn't change how the Internet works.

Encouraging people to be ethical isn't actually a real way to prevent people copying photos you put up online.

We can encourage profit-driven megacorps to be ethical? Sure, by abolishing them. Otherwise, you're just screaming into the void.
I think what I said is a prerequisite for that. There will be no structural changes without widespread cultural changes.
> then it will be sampled by a computer and instantly recreated at scale.

You don't need to train the AI on the work for that. You don't even need to show the AI the work itself. You can just give the AI a vague description of the work and it is able to replicate something very close to it.

That's something you can try today, hand Claude or ChatGPT an image, let them describe the image, put that description into your favorite image generator. The output is a clean-room-clone of the original. It won't be a photocopy, but it will contain all the significant features that made up the original, even with surprisingly short descriptions of just a 100 words.

Won't be long and you can hand the AI a movie trailer and the AI will build you the rest of the movie from it.

> Why would we embrace this now that a computer can do it

You can't stop it long term. That old "How To Draw an Owl"-meme is reality now. You give the AI some key points and it will fill in all the rest. The issue here isn't so much copyright, but that we'll be so flooded with content that it will be impossible for anybody to stand out. We might be heading towards the death of static content and heading into a world were everything is generated on the fly.

Well, just as another perspective...

I'm not convinced that the philosophy of copyright is a net positive for society. From a certain perspective, all art is theft, and all creativity builds upon preexisting social influences. That's how genres develop, periods, styles... and yes, blatant ripoffs and copycats too.

If the underlying goal is to be able to feed creators, maybe society needs better funding models...? The current one isn't great anyway, with 99% of artists starving and 1% of them becoming billionaires.

I'd much prefer something more like the model we have for some open-source projects, where an employer (or other sponsors) pays the living wage for the creator, but the resulting work is then reusable by all. Many works of the federal government are similarly funded, where a government employee is paid by your taxes but their resulting work automatically goes into the public domain without copyright.

I don't buy the argument that nobody would make things if they weren't copyrightable/paid directly. Wikipedia, OSM, etc. are all living proof that many people will volunteer their time to produce creative things without any hope of ever getting paid. As a frequent contributor to those and also open-source code, Creative Commons photography, etc., a large part of the joy for me is seeing how my work gets reused, transformed, and sometimes stolen by others (credit is always nice, but even when they don't mention me, at least I know the work I'm doing is useful to people).

But the difference for me is that I don't rely on those works to put food on the table. I have a day job and can afford to produce those works in my spare time.

I wish all would-be creators would have such a luxury, either via an employer relationship or perhaps art grants and the such. I wonder how other societies handle this... back in the day, I guess there were rich patrons, while some communities sponsor their artists for communal benefit. Not sure what works best, but copyright doesn't have to be the only way society could see creative outputs.

> I'm not convinced that the philosophy of copyright is a net positive for society.

I'm ok with that. But the philosophy of copyrights is not under debate here. All that is being debated is if it should protect small people from big corporations too.

It's not? I thought we were talking about "AI SHOULD be trained on everything that is in the public sphere" and "[your work] will be sampled by a computer and instantly recreated at scale. [...] Commercial art producers have always ripped off minor artists". Isn't that all about copyright and the ability to make money off your creative works?

When I put something on Wikipedia or any other commons, I don't worry about which other person, algorithm, corporation, or AI ends up reusing it.

But if my ability to eat tomorrow depended on that, then I would very much care. Hence, copyright seems an integral part of people's ability to contribute creatively.

My argument is that by detaching their income from the reusability of their work, we would be able to free more creators from that constraint. Under such a system, the little guy would never get rich off their work, but they wouldn't starve when a big corporation (or anyone else) rips them off either.

> I'm not convinced that the philosophy of copyright is a net positive for society.

It absolutely is, just not in it's current overpowered form.

> where an employer (or other sponsors) pays the living wage for the creator, but the resulting work is then reusable by all.

Some creators want control over their own narrative, and that's entirely reasonable, at least for a limited time.

> I don't buy the argument that nobody would make things if they weren't copyrightable/paid directly.

That was never the argument as far as I'm aware. There are other concerns, like a creator losing all control of their creation before they had a chance to even finish what they wanted to do/tell.

This benefits actually everyone.

If our combined creative work until this point is what turns out to be necessary to kick-start a great shot at abundance (and if you do not believe that, if it's all for nothing, why care at all about the money wasted on models?) it might simply be our societal moral obligation to endorse it -- just as is will be the model creators moral obligation to uphold their end of this deal.

Interestingly, Andrej Karpathy recently described the data we are debating as more or less undesirable to build a better LLM and accidentally good enough to have made it work so far (https://youtu.be/hM_h0UA7upI?t=1045). We'll see about that.

I want to see any indication that abundance form AI would benefit man kind first.

While I would love Star Trek society has been going very much towards Cyberpunk aesthetic aka "the rich hold all the power".

To be precise AI models fundamentally need content to survive but they need so much content there is no price that makes sense.

Allowing AI to monetize without enriching the people who allowed it to exist isn't a good path forward.

And to be clear I do not believe there is a fundamental rift here. Shorten copyright to something reasonable like 20 years and in a decade AI will have access to all of the data it needs guilt free.

There are glimpses. Getting a high score on an Olympiad means there is the possibility of being able to autonomously solve very difficult problems in the future.
> I don't agree because it creates this dilemma for creators: you need to put your work out there to get traction, but if you put your work out there and anything public is fair game, then it will be sampled by a computer and instantly recreated at scale. This might even happen without the operator knowing whose work is being ripped off.

This is no different than the current day, copying already happens (as your friends have seen) AI makes it a little easier but the same legal frameworks cover this - I don’t see why AI stealing is any different than a person doing the same thing. The ability to copy with zero cost was incredibly disruptive and incredibly beneficial to society. Settled case law will catch up and hopefully arrive at the same conclusion it has for human copyright infringement (is it close enough to warrant a case)

Where the puck is about to be is very different from where it is. Generative AI hasn't cracked the creativity problem yet. It can generate new art but it can't develop its own style like a human can (from first principles, humans basically caricature high quality video feed).

There is pretty good reason to believe that this will be a solved problem inside a decade. We're moving towards processing video and the power put behind model training keeps increasing. How much is a style worth when computers can just generate 1,000 of them at high speed? It is going to be cheap enough that legal protection is almost irrelevant; ripping off a style will probably be harder than just creating a new original one.

We can wait a bit to find out where the equilibrium is before worrying about what the law should be.

I'm not convinced machines can come up with styles like humans can. After all, a style will be judged by humans. How humans respond cannot be determined from previous styles.
There is either copyright violation or there isn't. Like you said, artists can still sue companies for copying their work, AI or not. If the work was transformative enough then, well, what's the problem?
It doesn’t recreate anything outside of edge cases you really have to go looking for. It will ingest and spit out the style though and I see nothing wrong with that. It’s basically what people do right now.
Who cares? Don't we want the most absolute intelligence to help human civilization? Credits and creators are below that.
Creators may disagree.
AI isn't ripping off anyone's work. Certainly if it is, it's doing so to a much lesser extent than commissioning an artist to do a piece in another artists style is.
Information wants to be free, man.
My wife works in a studio with a gaggle of artists who all blatantly "rip each other off" constantly.
AFAIK, AI models have no way of differentiating high quality input from garbage. If it's fed peer-reviewed, academic papers as well as a paranoid, violent person's Facebook manifesto it treats them with equal weight as long as the sentences are coherent.
On some level it needs to be fed some amount of garbage because it takes in all sorts of garbage inputs like we do.

AI that needs painstakingly curated training data isn't interesting in the same way that early lightbulbs that used precious metals and cost too much to be commercially viable aren't interesting.

When I speak to my friends, it's a conversation not wholly private - after all, I've shared whatever I'm saying with them - but it certainly isn't wholly public.

In all our conversations, we have and we understand there are degrees of privacy; that which we share with family, that with friends, that with strangers.

When I post on-line, I both expect and expected that my conversations would be between me and the group of people I conversed with. I knew who was reading, and I was fine to write whatever I was writing to that group.

I may be wrong, but I think this is generally how people feel, how they act, what they expect, how they are, as humans. We think about who we are writing to. It does not come naturally to imagine that third parties are listening in, or will listen in, in the decades to come.

This brings us to now, with a third party, reaching back over ten or fifteen years, for absolutely everyone, everywhere, taking copies of everything it can get access to, for its own use, whatever that may be.

I profoundly reject Microsoft, and Google, and all entities and companies which act in such ways, these smiling evils, with their friendly icons and bright colours, happy faces and hundred page T&Cs to utterly obscure and obliterate the truth of their actions.

That sounds compelling when you borrow the marketing term "AI" and position the work as part of a sweeping revolution into some beautiful sci-fi future.

It's less compelling when you see the technology as noisy content generators that will flood the network with spam and devour the livelihood and opportunity to learn for low-market artists and programmers.

In the former perspective, you may look at this is "well, what's the best way we can make this happen?" while the latter sees it more like "So you insist on making this happen. Are you sure there's a suitably responsible way for you to do that?"

Why?

A statement that extraordinary would be interesting if it had some reasoning alongside it.

Also, Facebook posts aren't really "in the public sphere" / publicly accessible, but that's a nitpick.

This is just copyright infringement reworded to pretend it's not. I own the things I write, and publishing it on the internet doesn't negate that. OpenAI doesn't have the right to claim it, no matter what they think, and neither does anyone else.
Firstly publishing something on Facebook explicitly gives them the right to "copy" it. It certainly gives them the right to exploit it (it's literally their business model.)

Secondly, Facebook is behind a login, so it's not "public" in the way HN comments are public. You'd have gained more kudos had you argued that point.

Thirdly this article I about MetaAI not OpenAI. So, no, OpenAI isn't claiming anything about your Facebook post.

I'll assume however that you digressed from the main topic, and were complaining about OpenAI scraping the web.

Here's the thing. When you publish something publically (on the internet or on paper) you can't control who reads it. You can't control what they learn from it, or how they'll use that knowledge in their own life or work.

You can of course control republishing of the original work, but that's a very narrow use case.

In school we read setwork books. We wrote essays, summaries, objections, theme analysis and so on. Some of my class went on to be writers, influenced by those works and that study.

In the same way OpenAI is reading voraciously. It is using that to assign mathematical probabilities to certain word pairings. It is studying published material in the same way I did at school, albeit with more enthusiasm, diligence and success.

In truth you don't "own the things you write" not in the conceptual sense. You cannot own a concept, argument or position. Ultimately there is nothing new under the sun (see what I did there?) and your blog post is already a rehash of that which came before.

Yes, you "own" the text, to the degree to each any text can be "owned" (which is not much.)

>Firstly publishing something on Facebook explicitly gives them the right to "copy" it. It certainly gives them the right to exploit it (it's literally their business model.)

This isn't necessarily true for a user content host. I haven't read Facebook's TOS, but some agreements restrict what the host can do with the users' content. Usually things like save content on servers, distribute it over the web in HTML pages to other users, and make copies for backups. This might encourage users to post poetry, comics, or stories without worrying about Facebook or Twitter selling their work in anthologies and keeping all the money.

>In school we read setwork books. We wrote essays, summaries, objections, theme analysis and so on. Some of my class went on to be writers, influenced by those works and that study.

Scholarly reports are explicitly covered under a Fair Use exception.

https://www.copyright.gov/help/faq/faq-fairuse.html

But also be careful not to anthropomorphize LLMs. Just because something produces content similar to what a human would make doesn't mean it should be treated as human in the law. Or any other way.

OpenAI is not reading voraciously, it is not a human being. It makes copies of the data for training.

If there were an actual AI system which was trained by continuously processing direct fetches from the Web, without storing them but directly using it when for internal state transitions, then that might make the reading analogy work. But then AI engineers couldn't do all the analysis and annotation steps that are vital to the training process.

Beautifully written. Thanks.
> publishing it on the internet doesn't negate that

The terms of use of most sites (including this one) include giving the site owners a license to use what you post, often in any way they see fit.

We need to distinguish between modalities of machine intelligence and proceed to set policy tailored to each specific type. (Arguably) we can further include the variable of private, public control; and the orthogonal matter of private or public service.

Machine intelligence is anthropomorphic in utility, that is it serves as either surrogate or substitute for a human cognitive capability. This permits enumeration of AI utility categories. Broadly we can distinguish between creative, knowledgeable, analytical, judicial, predictive, and directing.

As an example use of this approach, consider the case of the AI trained on all public domain material and optionally having had training access to private matter (think Vatican archives). Such an instance should generally not be afforded creative rights, but we would be remiss to restrict its utility as a knowledge base.

The other parameters noted in terms of dual of public|private can of course have bearing on setting type specific constraints.

Are you concerned that this approach will lead to the abandonment of an open web?

If AI companies, and whatever comes next, are expected to take advantage of everything shared online, regardless of copyrights, it seems reasonable that people will stop sharing most things of value.

If you never display your work online, you’ll probably never gain any traction as an artist.
I'd be concerned with getting traction if my art is online and anyone can feasibly copy my style.

Maybe that's unimportant and no different than being able to make physical copies, though good forgeries haven't always been so easily done and the forgery is meant to be an identical copy of the original work. It could just be me, but the idea of an attempted identical copy of a well known work feels different than a new creation being passed off as the work of a well known artist. For example, you can claim to have a really god copy of the Mona Lisa but that wouldn't be as valuable as claiming you have a previously unknown, unique work from the artist.

The question become are post with a limited reach (friends) the public sphere.
Yet, you signed an EULA for every publicly available service that legally prevents you from doing anything the company doesn't want you to do. So why do you want them to be legally using your data without, while they explicitly deny you things like scraping data from their platform.
Is there anything in the "public sphere" that is not (a) published to the web and (b) under a license that allows Meta to use it for training "AI".

It seems that "AI" is biased toward (1) only bits, and (2) only bits that are published to the internet.

I think the question is what is included in the "public sphere".

If I'm a Facebook user, I definitely don't see posts that I meant to share among friends as something that should be considered part of the public sphere.

You're probably among friends on this site, but outside tech coded spaces, most people understand that publicly available is not the same thing as an unlimited license to do whatever you want.
While true, likely more pertinent that most people don't have a clue what's possible legally or technically until it gets in the news.

Can't give informed consent if you don't know what the EULA means or what the machines can do.

This site is weird when you compare it to open source software projects and these same companies selling those as a service on their platforms it is again huge massive problem and exploitation... When the license explicitly allows that, without single legal question.

I wonder if things would be different if software could be copied and then recreated by these models by the mega corps. Would there still be such push in favour of it?

The people arguing in favor of AI use are not in fact arguing for "an unlimited license to do whatever you want" so that solves that apparent hypocrisy nice and quick.

Also your theoretical software cloner would also make clones of proprietary software, right? I think that would be welcomed just fine.

They're apparently arguing for the legal right to use all content on the internet to create a product that is commercial and competes with the original content.
As long as it's only borrowing very small amounts from any particular source work, I think it's fine for a new work to be commercial and compete with the originals.
They ingested the whole work.
"If they can, they will."