Hacker News new | ask | show | jobs
by fzliu 888 days ago
In my mind, what's more crucial here is code for downloading/scraping and labeling the data, not the model architecture nor training script.

As much as I appreciate Mis(x)tral, I would've loved it even more if they released code for gathering data.

3 comments

I'm speculating they are attempting to avoid controversy about their datasources. That and a possible competitive edge depending on what specific sets/filtering they're using.
To avoid controversy AND potential lawsuits.
Yup.

I think many countries (japan already has) will allow IP for training data.

They just need to buy time until then.

It’s common for third party model testers to not disclose what they mean by “Refusal” parameter as well, for obvious reasons. The world is full of witch-hunting maniacs now and will stay so for an indefinite amount of time. Just wait until the whole thing becomes more widely known and they realize. All AI companies have to hurry up before the doors shut.
IMHO much of the key training data can't simply be downloaded/scraped/labeled, no matter what code you had - it's not like it's freely accessible to everyone and just needs some code to get it and process it. You can't scrape all of Google Books archive or all of Twitter, and quite a few things that could be scraped at one point may actively prevent you from scraping them now.
I don't mind to have ready to use datasets instead the code for downloading/scraping and labeling. It will save a lot of time. It is not complicated to write some code for gathering the data, it might be sometimes impossible to replicate the datasets after all if some parts of the data which you have to scrape are already gone (removed because of various reasons).