Hacker News new | ask | show | jobs
by ghaff 545 days ago
I'm pretty sure any competent lawyer would stipulate that, in many/most cases, training is happening on copyrighted information. I'm also pretty sure that OpenAI is not arguing that all their training data is either licensed or they own the copyrights to. (Some companies, perhaps Adobe?, have been more conservative.) Perhaps I'm wrong. But I haven't heard that argument publicly and I would need to be convinced.
1 comments

Discovering certain types of data were gathered and used would be much worse.

Training on CNN and Netflix content = i sleep

Training on private personal and corporate inboxes, medical records, and illegal content, purchased from blackhat data brokers = real shit

A Kenyan data labeler famously cut ties with Openai after Openai asked them to gather CSAM content.

Citation on that?
Gather and label are two wildly different things that change the entire context. They aren't saying go find this stuff for us, they are saying if people upload it or you find it in the data then, label it as such.
It only changes who actually gathered the CSAM they asked this person to label. OpenAI definitely gathered it.