Hacker News new | ask | show | jobs
by rebelde 1188 days ago
Why write your thoughts on the web when AI/GPT is only going to steal and paraphrase it? Nobody sees what you write and everybody thinks GPT is the genius.
9 comments

Just saw something today where the wife of TotalBiscuit, who died of cancer several years ago, is contemplating deleting all of his Youtube videos[1] to prevent people from using A.I. to make him say terrible things.

Did give me a bit of a pause about putting stuff out there. Although I think I'd still rather have my data be used for training A.I. than not (and I probably am already in the training data anyway, I believe I saw that one of the datasets it's been trained on was Hacker News comments).

[1]: https://kotaku.com/totalbiscuit-john-bain-youtube-delete-vid...

Given that the "AI" community apparently couldn't care less about treating intellectual property rights with wanton abandon, I can't say such a response would be unwarranted.

Dire circumstances call for drastic measures, as they say.

Quite a sad, but completely understandable reaction. The saddest part is probably that it's already too late to prevent people from generating TB deepfakes and other content. Cloning a voice takes half an hour if clips now, any downloaded live stream should be enough already.

It's sad to see AI on a path to destroy years of collected internet content. I expect the internet archive to receive loads of takedown requests in the coming months and years because of this.

I would like to make the opposite argument. All these days I didnt share my thoughts because everyone else was and my voice would be drowned in a sea of voices. In post GPT4 era its easier to stand out if your thoughts are actually original and refreshing because most people sound like their thoughts have been written by GPT.

To rephrase it another way, the reign of the conformist ends here and the reign of the contrarian begins now.

A lovely sentiment in theory, but Waldo is still perniciously difficult to find even though he dresses differently from every other character.
What if all characters other than waldo were just dressing the same because they were trying to ape each other to get fictitious points on social forums. Internet has trained an entire generation to make arguments to get validation on social media that definitely reflects in the ideas that are put forward.
Or just the reign of brevity. Sheer volume is no longer impressive.
Great point. More volume in explaining the same thought is more GPT like.
Your ideas are low probability autocomplete. GPT wants popular ideas, not novel ideas.
I was trying to say that what most people say is mostly unoriginal and is very reminiscent of GPT style writing. What data GPT trains on or pays attention to is another question.
That's why I keep my content as low quality as possible - keeps the machines humble.
I'll just run it though an AI upscaler before I run it though the AI language model.
We don't need an upscaler, we need an upclasser so all the ASCII Dickbutts drawn get little top hats and monocles put on them.
The general problem of "AI"s being trained on copyrighted content needs to be discussed more thoroughly, I think.
Every time I bring this up, people accuse me of resisting progress, "the cats out of the bag", etc.

It has been frustrating.

The cat is out of the bag, and I don't see any reason training should be any more controlled than me personally viewing something and 'training' my brain on it. Using either to duplicate copyrighted works is already clearly illegal.
It is illegal for you to download copyrighted material and distribute it as your own. Models trained on such data can (and are statistically more likely) to produce similar output as their (training) input.

So training must consider licencing where copyright material is used and not consume all data.

Your brain is not a model. You can not reproduce most of what you see. You're not "training" your brain by glancing at an image as your recall concerning that image will be terrible.

My brain can certainly recreate something it’s seen before. And it can certainly create something similar to a thing it’s seen before. It’s legal to do the latter and illegal to do the latter. Imperfections on the exact recreations don’t affect the legality of it.

Am I violating copyright law because I am merely capable of producing a copy of something? Obviously not. Why should the model be?

>It is illegal for you to download copyrighted material and distribute it as your own

I'm sure the millions of people who violate copyright law daily with absolutely no repercussions care very much about that.

Millions of people dont pay taxes and cross the road in the wrong place.

You cant setup a cinema and charge ticket for the movies you stole.

Its the money making side that matters - not individuals ij a private house

There will just be checks that make sure that the generated content is not similar enough to violate copyrights of training material and that's it.
For the same reason that the police being able to have a person look up in a physical printed file who owns a particular car via its license plate is not the same as having a network of cameras and computers that track every car in the city.
Yeah I don't have any problem with that too. If a cop has a right to see me, he should be legally allow to record me (and in fact would prefer all cop interactions were recorded). A camera + AI allows for massive cost savings on basic police work, enabling police to be more efficient. A camera has a lot less bias than a cop.
It's because you (and all of us) have a teeny human brain, and these are terrible at remembering things, so the teeny little bits you can remember are protected under Fair Use.
I think it’s not very hard; if the AI companies believe the data they trained on is public domain/open because they scraped it of the internet, then their trained weights must publicly available as well. They cannot claim ‘but training is expensive’; if they do, then they should pay fees for the hosting and storage and writing time of all data they scraped. I prefer open weights as it’s more practical. Your weights have a sliver of GPL source in it? Well that infected the entire thing as GPL does: it is ours now too!
The current (legal) answer is "unclear". There are indications that training is fine, but producing and using the generated content is questionable at least. As many IP issues, it will solved only when someone will try that in court and go all the way until a verdict. Some cases are actually being processed but it might take years to get an answer.
> The general problem of "AI"s being trained on copyrighted content

> The current (legal) answer is "unclear".

European Union was ahead of times for once. The 2019 copyright directive, article 4, makes it legal to scrape the web and make and keep local copies of copyrighted works, for data mining purposes. Unless the copyright holders set up a machine readable exception (such as robots.txt file).

So legal in EU, "unclear" in US.

That does not, to me, automatically imply that an "AI" lawfully regurgitating copyrighted content is a "data mining purpose".
Consider that an AI may cite many snippets of copyright publications into a chimera of 'Facts'.

'copyright fair use' : https://copyrightalliance.org/faqs/what-is-fair-use/

Does OpenAI respect Robots.txt? Do we know?
Copyright's been dead since the internet was born. I really do think it's the least of our problems when it comes to abstract reasoning engines.
Becoming part of the cultural lexicon is the ultimate goal of thought leadership.

Just look at how many people say stuff like “Two women can’t make a baby in 4.5 months”. Someone (Brooks) had to invent, write down, and popularize that analogy.

Why write your thoughts on the web when other humans are going to steal and paraphrase it? I mean... you're on HN. Don't tell me you didn't notice people often regurgitate tech influencers like Paul Graham and Joel Spolsky's thoughts.
Anonymous people regurgitate the thoughts of well-known individuals such as Paul Graham and Joel Spolsky. The fact that their thoughts are regurgitated is a testament to how well known they are already and how much their content is read by other people. Nobody is going to steal their limelight only on the basis of paraphrasing their ideas. However, if someone does write original ideas of their own, they may gain some notoriety for themselves.

Now imagine that Paul Graham and Joel Spolsky were able to read everything being written by every anonymous unknown on the internet, and create content paraphrasing any and every original thought that was created by anonymous individuals at will. How do the original creators of these thoughts have any chance to succeed on their own merit, if Paul Graham and Joel Spolsky (who everyone knows already as sources of ideas) are able to write the same stuff as soon as the anonymous person has made it public?

If Paul Graham is expressing every conceivable thought then he’s not a very interesting person to read because he has no perspective on anything.

But if a model starts generating better content than Paul Graham in a nice curated form, then yeah, Paul Graham ought to find a better way to spend his time because he is not adding value.

Imagine a friend asks for help in a class. You can either spend some time and try to teach them the subject or let them copy off you during the exam. The former generally feels good despite taking more effort. The latter often feels bad even if it doesn't impact you negatively in any way and helps your classmate more than if you did nothing.

The human to human connection that a blog or social media conversation creates feels a lot more like teaching your classmate while the AI feels a lot more like someone cheating off your work. Plus the AI didn't even bother to get your approval before copying from you. The whole thing feels ethically compromised regardless of the ultimate result.

This was the place I reached. I'm not concerned about "stealing", exactly, but I don't want to contribute to this technology.

I think my days of sharing things freely on the web are over.

So maybe only post dumb and incorrect information.

Train it to be wrong on purpose, for a joke.

Because you can get points on Hacker News.