Hacker News new | ask | show | jobs
by vbo 753 days ago
I don't want to defend Altman. He may or may not be a good actor. But as an engineer, I love the idea of building something magical, yet lately that's not straightforward tinkering - unless you force your way - because people raise all sorts of concerns that they wouldn't have 30 years ago. Google (search) was built on similar data harvesting and we all loved it in the early days, because it was immensely useful. So is ChatGPT, but people are far more vocal nowadays about how what it's doing is wrong from various angles. And all their concerns are valid. But if openai had started out by seeking permission to train on any and every piece of content out there (like this comment, for example) they wouldn't have been able to create something as good (and bad) as ChatGPT. In the early search days, this was settled (for a while) via robots.txt, which for all intents and purposes openai should be adhering to anyway.

But it's more nuanced for LLMs, because LLMs create derivative content, and we're going to have to decide how we think about and regulate what is essentially a new domain and method and angle on existing legislation. Until that happens, there will be friction, and given we live in these particular times, people will be outraged.

That said, using SJ's voice given she explicitly refused is unacceptable. It gets interesting if there really is a voice actor that sounds just like her, but now that openai ceased using that voice, the chances of seeing that play out in court are slimmer.

5 comments

Google search linked to your content on your site. It didn't steal your content, it helped people find it.

ChatGPT does not help people find your content on your site. It takes your content and plays it back to people who might have been interested in your site, keeping them on its site. This is the opposite of search, the opposite of helping.

And robots.txt is a way of allowing/disallowing search indexing, not stealing all the content from the site. I agree that something like robots.txt would be useful, but consenting to search indexing is a long, long way from consenting to AI plagiarism.

Point is we couldn't have a way of consenting to ai training until after we had llms. And I'm guessing we will, pretty quickly.
>Point is we couldn't have a way of consenting to ai training until after we had llms.

Sure we could have. even if we're talking web 1.0 age, Congress passed a law as early as the early 00's for email, which is why every newsletter has to have a working unsubsribe link. so it's not impossible to do so.

regardless, consent is a concept older than the internet. Have an option of "can we use your data to X?" and a person says yes/no. It's that simple. we can talk about how much we cared 30 years ago, but to be frank that mistrust is all on these tech companies. They abused the "ask for forgiveness" mentality and proceeded to make the internet nearly impossible to browse without an ad blocker. Of course people won't trust the scorpoion one more time.

As an conrast, look at Steam. It pretty much does the same stuff on the inside, but it reinvests all that data back to the platorm to benefit users. So people mind much less (and will even go to war for) having a walled garden of their game library. Short sighted, but I understand where it comes from.

Licenses allowing derivative works without attribution preceded LLMs.
> But if openai had started out by seeking permission to train on any and every piece of content out there...

But why would anyone seek permission to use public data? Unless you've got Terms and Conditions on reading your website or you gatekeep it to registered users, it's public information, isn't it? Isn't public information what makes the web great? I just don't understand why people are upset about public data being used by AI (or literally anything else. Like open source, you can't choose who can use the information you're providing).

In the case being discussed here, it's obviously different, they used the voice of a particular person without their consent for profit. That's a totally separate discussion.

>why would anyone seek permission to use public data?

first of all it's not all public data. software licenses should already establish that just because something is on the internet doesn't mean it's free game.

>Unless you've got Terms and Conditions

The new york times did:

https://help.nytimes.com/hc/en-us/articles/115014893428-Term...

Even if you want to bring up an archive of the pre-lawsuit TOS, I'd be surprised if that mostly wasn't the same TOS for decades. OpenAI didn't care.

>Isn't public information what makes the web great?

no. Twitter is "public information" (not really, but I'll go with your informal definition here). If that's what "public information" becomes then maybe we should curate for quality instead of quantity.

Spam is also public information and I don't need to explain how that only makes the internet worse. and honestly, that's what AI will become if left unchecked.

> Like open source, you can't choose who can use the information you're providing

That's literally what software licenses are for. You can't stop people from ignoring your license, but breaking that license opens you wide open for lawsuits.

The right to copy public information to read it does not grant the right to copy public information to feed it into a for-profit system to make a LLM that cannot function without the collective material that you took.
That's the debatable bit, isn't it. I will keep repeating that I really don't see a difference between this and someone reading a bunch of books/articles/blog posts/tech notes/etc etc and becoming a profficient writer themselves, even though they paid exactly 0 money to any of these or even asked for permission. So what's the difference? The fact that AI can do it faster?
> That's the debatable bit, isn't it.

If people used the correct term for it, "lossy compression", then it would be clearer that yeah, definitely there's a line where systems like these are violating copyright and the only questions are:

1. where is the line that lossy compressions is violating copyright?

2. where are systems like chatgpt relative to that line?

I don't know that it's unreasonable to answer (1) with that even an extremely lossy compression can violate copyright. I mean, if I take your high-res 100MB photo, downsample it to something much smaller, losing even 99% of it, distributing that could still violate your copyright.

Again, how is that different than me reading a book then giving you the abridged version of it, perhaps by explaining it orally? Isn't that the same? I also performed a "lossy compression" in my brain to do this.
> is that different than me reading a book then giving you the abridged version of it, perhaps by explaining it orally?

That seems like a bad example, I think you are probably free to even read the book out loud in its entirety to me.

Are you able to record yourself doing that and sell it as an audiobook?

What if you do that, but change one word on each page to a synonym of that word?

10% of words to synonyms?

10% of paragraphs rephrased?

Each chapter just summarized?

The first point that seems easier to agree on isn't really about the specific line, just a recognition that there is a point that such a system crosses where we can all agree that it is copying and that then the interesting thing is just about where the boundaries of the grey area are (i.e. where are the points on that line that we agree that it is and isn't copying, with some grey area between them where we disagree or can't decide).

> how is that different than ...

In one case, you are doing it and society is fine with that because a human being has inherent limitations. In other case, a machine is doing it which has different sets of limitations, which gives it vastly different abilities. That is the fundamental difference.

This also played out in the streetview debate - someone standing in public areas taking pictures of surroundings? No problem! An automated machine being driven around by a megacorp on every single street? Big problem.

I think that must be it.

There's an unstated assumption that some authors of blog posts have: if I make my post sufficiently complex, other humans will be compelled to link to my post and not rip it off by just paraphrasing it or duplicating it when somebody has a question my post can answer.

Now with AIs this assumption no longer holds and people are miffed that their work won't lead to engagement with their material, and the followers, stars, acknowledgement, validation, etc. that comes with that?

Either that or a fundamental misunderstanding of natural vs. legal rights.

human vs. bot is all the difference:

- a human will be an organic visitor that can be advertised to. A bot is useless

- A human can one day be hired for their skills. An AI will always be in control of some other corporate entity.

- volume and speed is a factor. It's the buffet metaphor, "all you can eat" only works as long as it's a reasonable amount for a human to eat in a meal. Meanwhile, a bot will in fact "eat it all" and everyone loses.

- Lastly, commercial value applies to humans and bots. Even as a human I cannot simply rehost an article on my own site, especially if I pretend I read it. I might get away with it if it's just some simple blog, but if I'm pointing to patreons and running ads, I'll be in just as much trouble as a bot.

> I really don't see a difference between this and someone reading a bunch of books/articles/blog posts/tech notes/etc etc and becoming a profficient writer themselves

tangential, but I should note that you in fact cannot just apply/implement everything you read. That's the entire reason or the copyright system. Always read the license or try to find a patent before doing anything commercially.

To me it's more like photocopying the contents of a thousand public libraries and then charging people to access to your private library. AI is different because you're creating a permanent, hard copy of the copyrighted works in your model vs. someone reading a bunch of material and struggling to recall the material.
Quantity has a quality all its own.
You state that as a fact. Is that your opinion, or based on a legal fact?
> Google (search) was built on similar data harvesting and we all loved it in the early days, because it was immensely useful. So is ChatGPT, but people are far more vocal nowadays about how what it's doing is wrong from various angles.

Part of that is that we've seen what Google has become as a result of that data harvesting. If even basic search engines are able to evolve into something as cancerous to the modern web as Google, then what sorts of monstrosities will these LLM-hosting corporations like OpenAI become? People of such a mindset are more vocal now because they believe it was a mistake to have not been as vocal then.

The other part is that Google is (typically) upfront about where its results originate. Most LLMs don't provide links to their source material, and most LLMs are prone to hallucinations and other wild yet confident inaccuracies.

So if you can't trust ChatGPT to respect users, and you can't trust ChatGPT to provide accurate results, then what can you trust ChatGPT to do?

> It gets interesting if there really is a voice actor that sounds just like her, but now that openai ceased using that voice, the chances of seeing that play out in court are slimmer.

It's common to pull things temporarily while lawyers pick through them with fine-toothed combs. While it doesn't sound like SJ's lawyers have shown an intent to sue yet, that seems like a highly probable outcome; if I was in either legal teams' shoes, I'd be pulling lines from SJ's movies and interviews and such and having the Sky model recite them to verify whether or not they're too similar - and OpenAI would be smart to restrict that ability to their own lawyers, even if they're innocent after all.

> Google (search) was built on similar data harvesting and we all loved it in the early days

Google Search linked back to the original source. That was the use case: to find a place to go, and you went there. Way less scummy start than OpenAI.

As an engineer, the current state of LLMs is just uninteresting. They basically made a magical box that may or may not do what you want if you manage to convince it to, and fair chance it'll spout out bullshit. This is like the opposite of engineering.
In my opinion, they're extremely interesting... for about a week. After that, you realise the limitations and good-old-fashioned algorithms and software that has some semblance of reliability start to look quite attractive.