Hacker News new | ask | show | jobs
by pointlessone 906 days ago
I think your question is incorrect. It’s very likely no-one thinks it’s perfectly legal. There probably are many people who think it’s not a big deal, though. Try coming up with a dataset that doesn’t have any copyrighted material in them. Like seriously try. You can’t use pretty much anything newer than a century old. Everything is copyrighted by default. Very few new things are explicitly in public domain or licensed in a way that would allow usage. Now imagine LLMs trained on early 20th century newspapers, books and letters. Do you think it would be good at generating code or hip copy for homepage of your next startup?
2 comments

> Now imagine LLMs trained on early 20th century newspapers, books and letters. Do you think it would be good at generating code or hip copy for homepage of your next startup?

Not sure about the rest of the world, but at least for US content I don't think any company would publish that LLM.

That's like 40 years before the civil rights movement, and right about the time of the Tulsa massacre.

It's right around when women got the right to vote.

Trying to get it to not say anything horrible under modern standards seems fraught with issues. I don't know if it would even understand something like "don't be racist", given the context it was trained on.

Exactly. Copyright terms are so long that most material with expired copyright is not useful for modern uses of LLMs and looking for modern non-copyrighted materials is too hard to do quickly and its usefulness is unclear. So people who grew up with Internet and are used to making memes with copyrighted material are not exactly averse to do it on a bigger scale.
> Try coming up with a dataset that doesn’t have any copyrighted material in them.

Isn't this what Mistral AI did?

Did they? That'd be interesting to take a look at. Do they publish contents of their dataset?
The RAW Weights here: https://docs.mistral.ai/models/