Hacker News new | ask | show | jobs
by donfuzius 1161 days ago
It's awesome that the OpenAssistant project made it this far with a lot of crowed-sourced input. Congrats to the whole team that works really hard trying to create a truly open LLM.

One thing that puzzles me though, is that for the GPT-3.5 comparison, the model used is trained using both OpenAssistant and alpaca data, which is not free due to the OpenAI license used to generate the data. Isn't that defeating the purpose?

"... Completions were generated using pythia-12b-deduped fine-tuned on the OpenAssistant and Alpaca [9] dataset as well as gpt-3.5-turbo using the OpenAI API..."

3 comments

> due to the OpenAI license used to generate the data.

What makes you think OpenAI responses are copyrighted in any way?

If openai owns openassistant because it was trained in part on chatgpt outputs, then andrew hussie owns chatgpt because it was trained in part on homestuck
Copyright of AI output is not proven.
This is not about copyright but about the OpenAI terms of use that you agree to when you use ChatGPT or the API, which forbids using the output to build «competing models».
Is rather think it's the opposite, it's almost definitely proven that it is not - it is obviously completely transformative.