| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rg111 987 days ago
	This is bad for the small guys, though. Do you think that DeepMind, OpenAI, etc. don't have this dataset copied 10 times over? Think again!

3 comments

quonn 987 days ago

How is this different from running pirated software for example? Big business doesn‘t do it. And if they do they get fined heavily. Same should be true for illegal AI training data. Surely this is a problem that can be solved.

link

michaelt 987 days ago

> How is this different from running pirated software for example? Big business doesn‘t do it.

All the current LLMs are trained on data scraped from random websites, with no regard for the website's copyright, aren't they?

Presumably because these big businesses have a theory training an LLM is 'fair use'.

Why wouldn't they treat books the same way they treat web pages?

link

__loam 987 days ago

I'm tired of tech companies abusing the public then asking for forgiveness.

link

MPSimmons 987 days ago

I actually think this kind of use is purer to the original intent of the web, where everything on the internet was freely available and consumption was encouraged. "Information wants to be free" used to be the rallying cry.

That being said, it feels like there's also a shade of perspective from the old quote:

"In its majestic equality, the law forbids rich and poor alike to sleep under bridges, beg in the streets and steal loaves of bread." - assuming everything in public is fair game, then everyone is welcome to build a multi-petabyte database of text and use millions of dollars worth of GPUs to train an AI on it.

link

__loam 987 days ago

That last point is great. I've definitely seen a lot of people talking about how we need to let the little guy develop their own AI, with very little attention paid to the actual realistic costs of doing so. GPT-4 I believe cost $100 million ish to train.

link

oooyay 987 days ago

Bold of you to think they're asking for forgiveness at all. I've seen mostly apologists.

link

disposition2 987 days ago

I agree and feel the same way.

But it also feels like standing at the edge of the sea complaining about the tide coming in...I'm not sure there's really much that can / will be done about it.

link

rpd9803 987 days ago

We've tried nothing and are all out of ideas!

link

simbolit 987 days ago

Except big business does it all the time. Your parent comment mentioned OpenAI, who does it.

link

justapassenger 987 days ago

It's a valid question, how can smaller companies be competitive in this business.

But saying "we steal to stay relevant, because we don't have as much funds as our competitors" is not the answer.

link

delecti 987 days ago

Your argument seems to be: why stop the little guys from doing something illegal if the big guys are doing it too? We should ideally stop them all, but it isn't surprising that the blatant examples of illegality are stopped first.

link

artninja1988 987 days ago

If datasets are not shared externally, like how deepmind etc. does it, it is much more of a grey area. Training may well end up being fair use. So yes, the efforts of Shawn compiling books3 just levels the playing field a bit.

link