Hacker News new | ask | show | jobs
by nickpsecurity 562 days ago
I’ve been wanting to run into someone on the Databricks team. Can you ask whoever trains models like MPT to consider training an open model only on data clear of copyright claims? Specifically, one using only Gutenberg and the permissive code in The Stack? Or just Gutenberg?

Since I follow Christ, I can’t break the law or use what might be produced directly from infringement. I might be able to do more experiments if a free, legal model is available. Also, we can legally copy datasets like PG19 since they’re public domain. Whereas, most others have works in which I might need a license to distribute.

Please forward the request to the model trainers. Even a 7B model would let us do a lot of research on optimization algorithms, fine-tuning, etc.

1 comments

I think you're looking for OLMo, https://allenai.org/olmo
They appear to use Common Crawl in the DCLM dataset. Just downloading Common Crawl is probably copyright infringement before we consider specific terms in the licenses. Arxiv papers have a mix of licenses with some not allowing commercial use.

If I got the sources right, it’s already illegal with just two sources they scraped. That’s why I want one on Gutenberg content that has no restrictions.