| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ghaff 545 days ago
	I'm pretty sure any competent lawyer would stipulate that, in many/most cases, training is happening on copyrighted information. I'm also pretty sure that OpenAI is not arguing that all their training data is either licensed or they own the copyrights to. (Some companies, perhaps Adobe?, have been more conservative.) Perhaps I'm wrong. But I haven't heard that argument publicly and I would need to be convinced.

1 comments

HeatrayEnjoyer 545 days ago

Discovering certain types of data were gathered and used would be much worse.

Training on CNN and Netflix content = i sleep

Training on private personal and corporate inboxes, medical records, and illegal content, purchased from blackhat data brokers = real shit

A Kenyan data labeler famously cut ties with Openai after Openai asked them to gather CSAM content.

link

BadHumans 545 days ago

Citation on that?

link

upghost 545 days ago

https://www.wsj.com/articles/chatgpt-openai-content-abusive-...

https://www.bigdatawire.com/2023/01/20/openai-outsourced-dat...

https://www.theguardian.com/technology/2023/aug/02/ai-chatbo...

https://www.businessinsider.com/openai-kenyan-contract-worke...

https://www.medianama.com/2023/07/223-kenyan-workers-call-fo...

They were asked to label CSAM, to clarify.

link

BadHumans 544 days ago

Gather and label are two wildly different things that change the entire context. They aren't saying go find this stuff for us, they are saying if people upload it or you find it in the data then, label it as such.

link

hansvm 544 days ago

It only changes who actually gathered the CSAM they asked this person to label. OpenAI definitely gathered it.

link