| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by fzliu 888 days ago
	In my mind, what's more crucial here is code for downloading/scraping and labeling the data, not the model architecture nor training script. As much as I appreciate Mis(x)tral, I would've loved it even more if they released code for gathering data.

3 comments

declaredapple 888 days ago

I'm speculating they are attempting to avoid controversy about their datasources. That and a possible competitive edge depending on what specific sets/filtering they're using.

link

ssgodderidge 888 days ago

To avoid controversy AND potential lawsuits.

link

declaredapple 888 days ago

Yup.

I think many countries (japan already has) will allow IP for training data.

They just need to buy time until then.

link

wruza 888 days ago

It’s common for third party model testers to not disclose what they mean by “Refusal” parameter as well, for obvious reasons. The world is full of witch-hunting maniacs now and will stay so for an indefinite amount of time. Just wait until the whole thing becomes more widely known and they realize. All AI companies have to hurry up before the doors shut.

link

PeterisP 888 days ago

IMHO much of the key training data can't simply be downloaded/scraped/labeled, no matter what code you had - it's not like it's freely accessible to everyone and just needs some code to get it and process it. You can't scrape all of Google Books archive or all of Twitter, and quite a few things that could be scraped at one point may actively prevent you from scraping them now.

link

pk-protect-ai 888 days ago

I don't mind to have ready to use datasets instead the code for downloading/scraping and labeling. It will save a lot of time. It is not complicated to write some code for gathering the data, it might be sometimes impossible to replicate the datasets after all if some parts of the data which you have to scrape are already gone (removed because of various reasons).

link