Hacker News new | ask | show | jobs
by xatalytic 1157 days ago
15,000 instruction tuning records generated by Databricks employees in seven of the behavior categories outlined in the InstructGPT paper (predecessor to ChatGPT). Coincides with the release of Dolly 2.0, which is trained exclusively on this dataset and demonstrates high quality (but not state-of-the-art) instruction-following behavior.

The data and models are licensed for commercial use, setting them apart from recent releases trained on data from OpenAI.

1 comments

>Coincides with the release of Dolly 2.0, which is trained exclusively on this dataset and demonstrates high quality (but not state-of-the-art) instruction-following behavior.

This is not correct. It was fine-tuned with this data set, but the model itself is the 12B Eleuther AI pythia model.

There are two, a 6B parameter model fine-tuned on GPT-J and a 12B parameter model fine-tuned on Pythia.
the GPT-J-6B one is Dolly 1.0, previously released

Dolly 2.0 is Pythia-12B fine-tuned on this new dataset

on their hugging face page [1] they admit the performance may not be much or any better than the original model (I am guessing this may be a weakness of Pythia-12B, which was intended for model-training research rather than best results)

the main point of Dolly 2.0 is the new dataset is unencumbered legally [2] whereas Alpaca et al were trained on ChatGPT transcripts, so commercialising those models would contradict OpenAI licensing terms

[1] https://huggingface.co/databricks/dolly-v2-12b

[2] https://www.databricks.com/blog/2023/04/12/dolly-first-open-...

I think there's probably nothing wrong with training on others' ChatGPT transcripts posted on the open web. OpenAI trains on source-available projects with non-commercial terms, so their lawyers have already been over a similar case and decided it should be fine.
Not just that: Imagine OpenAI going to court and establishing the legal precedent that makes their own product illegal.

So OpenAI can claim whatever they like, there is no way they will ever pursue legal actions, unless their intent is to (intentionally) lose the court case to establish the precedent that it is okay to train on random data you scraped from the internet.

We would also get into a weird situation anyhow where it is hard/impossible to prove whether all/some/none of the information in a dataset is curated by humans. So in the worst case, we will have companies work with human curators (but secretly supplement with gray sourced materials) during their training. Just like how its hard to get 100% slave free coffee beans or cacao.

I don't think it's about things being illegal per se

But that they can sue you because, by making a competing product with data obtained by using their product, you contravened their terms & conditions for using their product