| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by 1vuio0pswjnm7 35 days ago

The idea of "all the public works ever created" is easily contested. Not every work has been "published", let alone scanned, digitised or published to the internet

The marketing for "AI" uses phrases like "the sum of all human knowledge" to refer to what has been used to create "models". The assumed irrelevance of non-published, "private" works is dubious if not absurd

The internet now allows potentially anyone to publish anything, e.g., via personal websites, social media pages, etc. But that doesnt mean everyone partakes. How much of the unfiltered garbage published by those who do has been used to create these "models"

"AI" companies will not reveal exactly what "works" were used to create the "models"

2 comments

1vuio0pswjnm7 35 days ago

I'm not commenting above on the the question of "fair use" or about the tragedy of Aaron Swartz, I'm commenting on the word "all", i.e., the hype

But if I were going to comment on Swartz I would ask first whether the "AI" models are trained on the contents of JSTOR, or the contents of PACER (that are not being shared on the internet for free)

Otherwise, the comparison is difficult to make, IMHO

For example, with respect to any materials from JSTOR, the "stealing" was done by the pirate library contributors, not the "AI" companies not the "AI" companies. And with respect to PACER, the "stealing" by Swartz was, technically, done from government computers

If readers are into "above the law" consipracy theories about "AI" companies, check out the bizarre story of the OpenAI employee who was the document custodian witness for the plaintffs in the NYTimes copyright litigation. Committed suicide before testifying

muwtyhg 35 days ago

> The idea of "all the public works ever created" is easily contested.

Hence the word "public," implying that they are published and accessible.

> The internet now allows potentially anyone to publish anything, e.g., via personal websites, social media pages, etc. But that doesnt mean everyone partakes. How much of the unfiltered garbage published by those who do has been used to create these "models"

This seems like a nitpick instead of actually responding to the idea that they have stolen massive amounts of other peoples' work and are using it to enrich themselves. And the stealing is ignored or given a slap-on-the-wrist fine, which is not how it has worked for numerous other people in the past (the example being Aaron Schwartz). It's kind of irrelevant if the models do or do not train on low-effort text on the internet.