| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by aspenmayer 366 days ago

Sure, why not? lol

https://www.reddit.com/r/DataHoarder/comments/1entowq/i_made...

https://github.com/shloop/google-book-scraper

The fact that Meta torrented Books3 and other datasets seems to be by self-admission by Meta employees who performed the work and/or oversaw those who themselves did the work, so that is not really under dispute or ambiguous.

https://torrentfreak.com/meta-admits-use-of-pirated-book-dat...

1 comments

redox99 366 days ago

Books3 was used in Llama1. We don't know if they used it later on.

link

aspenmayer 366 days ago

My comparison was illustrative and analogous in nature. The copyright cartel is making a fruit of the poisonous tree type of argument. Whatever Meta are doing with LLMs is doing the heavy lifting that parity files used to do back in the Usenet days. I wouldn’t be surprised if BitTorrent or other similar caching and distribution mechanisms incorporate AI/LLMs to recognize an owl on the wire, draw the rest just in time in transit, and just send the diffs, or something like that.

The pictures are the same. All roads lead to Rome, so they say.

link

aprilthird2021 366 days ago

All of the major AI models these days use "clean" datasets stripped of copyrighted material.

They also use data from the previous models, so I'm not sure how "clean" it really is

link

dragonwriter 366 days ago

> All of the major AI models these days use "clean" datasets stripped of copyrighted material.

Which of the major commercial models discloses its dataset? Or are you just trusting some unfalsifiable self-serving PR characterization?

link

aprilthird2021 365 days ago

It's from my personal experience in the industry

link

aspenmayer 365 days ago

What are your thoughts on the origin of the LLaMA leak? It's interesting that the training data was torrented, and so was the leak. Perhaps we will never know? For the OSINT folks, not a lot to go on, or maybe a lot, depending?

https://en.wikipedia.org/wiki/Llama_(language_model)#Leak

https://archived.moe/g/thread/91848262#p91850335

https://github.com/meta-llama/llama/pull/73/files

link

aprilthird2021 364 days ago

I don't really know much about that, sorry

link

pclmulqdq 366 days ago

All written text is copyrighted, with few exceptions like court transcripts. I own the copyright to this inane comment. I sincerely doubt that all copyrighted material is scrubbed.

link

Tepix 366 days ago

Your brief comment is hardly copyrightable. Which makes your point moot.

link