Hacker News new | ask | show | jobs
by GaggiX 500 days ago
You can't license 20T of tokens, I guess it's hard to grasp how big these datasets are.
2 comments

"Can't" or "don't want to"?

OpenAI is talking about spending half a trillion US dollars, they have the money to license data.

In music, there is compulsory licensing and companies that use recorded music are able to make the economics work.

It needs to be repeated that these are not simply "tokens", they are the product of millions of individual people that are being appropriated for the financial gain of a very few other people.

>"Can't" or "don't want to"?

Can't. Even if someone has the money (I truly doubt), you can't contact millions of copyright owners (as you report).

Going with the flow here, does that mean if I build a little script that downloads just enough movies, songs and books from the internet I don't have to obey the current law, because it is a) too expensive, and/or b) impractical?

I'm sure you already see the folly of that argument.

Anyhow, flowing on, the allegedly totally inefficient governments of this world routinely contact millions and millions of legal entities, and many of them are poorer than Microsoft, Google, or even OpenAI, yet they somehow manage. So it seems to be practical.

Of course, that does not answer the cost thing, we all know governments just print more fiat money...

So we have been told that IP is indeed property and the property owner has a right to compensation for use. Nobody ever told me that I just have to be blatant enough to be scot-free. And I guess Sony, Warner Bros., Atlantic et. al. didn't get the memo either, or why would they sue a single university student for 4.5 million dollars? [1] This seemed and was much too much for a single university student to pay. So "too expensive" is off the table, too. Weird world.

[1] the Tenenbaum case. Tenenbaum was lucky but still broke afterwards.

>I don't have to obey the current law,

There is currently no law that states it is illegal to train a model on copyrighted work.

Well, the plaintiffs in the copyright suits are arguing that copyright law already does. But I can tell this conversation is not going to go far.
If it's impossible to do it legally, then they shouldn't be able to do it. Violating one person's rights is illegal, but violating a billion's rights for profit is fine?

I'm in support of them being able to do it, but the right avenue is by working and lobbying hard to change antiquated copyright laws. Being able to disregard copyright only if you have enough billions of dollars on hand is the worst outcome. It's literally laws that only apply to the poor.

Be careful where you're going here. If you maximally/strictly interpret copyright law, the Internet Archive (including Wayback Machine) is largely violating copyright all the time. (WAY beyond the ongoing dispute with the publishers over the lending library.) Most web content is non-permissively licensed.
I don't believe Internet Archive should be permitted to disregard copyright wholly either.
Or because the results of these models are so transformative that you could pass it off as fair use.
If that's the standard, then it is worth noting that we are talking about companies that are trying to do something that literally (as far as can be proven today) can't be done (build an AGI).

Contacting millions of people is something many businesses on earth do.

If these companies are already engaged in trying do do something that quite literally can't be done (again, as far as can be proven today), it's not out of line to ask them to at least try to do something that many other companies actually do in practice (pay lots of people).

It's important to be very clear that this is something that could be done, but that the AI companies do not want to even try to do.

It's not a problem for the music and video streamers. Get real. They could even have an AI do it for them!
Imagine if the US (public and private sector), transferred 2% of GDP to Warner Bros
Then you can't use it. Or maybe it's time to abolish copyright. Turnabout is fair play: if copyright binds me, it also binds you. If it doesn't bind you, it doesn't bind me. Anything else is pure corruption.