| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jmmcd 100 days ago
	> Since these companies can’t improve their AI models without fresh data created by human beings Totally wrong. Self-play dates back to Arthur Samuel in the 1950s and RL with verifiable rewards is a key part of training the most advanced models today.

2 comments

rdedev 100 days ago

Not totally wrong. Self play works well with if your problem can be easily simulated in an RL environment where the model can easily explore different states. RLHF or similar techniques is not that since we don't have exactly have a simulation environment for language modelling

Right now there are companies which hire software devs or data scientists to just solve a bunch of random problems so that they can generate training data for an LLM model. Why would they be in business if self play can work out so well?

link

notpachet 99 days ago

> Right now there are companies which hire software devs or data scientists to just solve a bunch of random problems so that they can generate training data for an LLM model.

Sounds like Macrodata Refinement.

link

vidarh 99 days ago

> Why would they be in business if self play can work out so well?

Because it is still cheaper.

link

cubefox 100 days ago

Current models don't yet use RLVR with self-play though, at least as far as we know. They use RLVR with large numbers of manually created RL environments.

But they will probably use self-play soon. See https://www.amplifypartners.com/blog-posts/self-play-and-aut...

link