| I don't understand why it's even a question that Meta trained their LLM on copyrighted material. They say so in their paper! Quoting from their LLaMMa paper [Touvron et al., 2023]: > We include two book corpora in our training dataset: the Gutenberg Project, [...], and the Books3 section of ThePile (Gao et al., 2020), a publicly available dataset for training large language models. Following that reference: > Books3 is a dataset of books derived from a copy of the contents of the Bibliotik private tracker made available by Shawn Presser (Presser, 2020). (Presser, 2020) refers to https://twitter.com/theshawwn/status/1320282149329784833. (Which funnily refers to this DMCA policy: https://the-eye.eu/dmca.mp4) Furthermore, they state they trained on GitHub, web pages, and ArXiv, which are all contain copyrighted content. Surely the question is: is it legal to train and/or use and/or distribute an AI model (or its weights, or its outputs) that is trained using copyrighted material. That it was trained on copyrighted material is certain. [Touvron et al., 2023] https://arxiv.org/pdf/2302.13971 [Gao et al., 2020] https://arxiv.org/pdf/2101.00027 |