Hacker News new | ask | show | jobs
by vineyardmike 693 days ago
> this also means that because we've exhausted the human generated content by now as means of training LLMs, new models will start getting trained with mostly the output of other LLMs

There is also a rapidly growing industry of people whose job it is to write content to train LMs against. I totally expect this to be a growing source of training data at the frontier instead of more generic crap from the internet.

Smaller models will probably stay trained on bigger models, however.

2 comments

If we owned our own data truly, we could all have passive income.
> growing industry of people whose job it is to write content to train LMs against

Do you have an example of this?

How do they differentiate content written by a person v/s written by LLM, I'd expect there is going to be people trying to "cheat" by using LLMs to generate content.

> How do they differentiate content written by a person v/s written by LLM

Honestly, not sure how to test it, but this is B2B contracts, so hopefully there's some quality control. It's part of the broad "training data labeling" business, so presumably the industry has some terms in contracts.

ScaleAI, Appen are big providers that have worked with OpenAI, Google, etc.

https://openai.com/index/openai-partners-with-scale-to-provi...