| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Nevermark 301 days ago

Another way to do reinforcement learning is to train a model to judge the quality of its own answers, to match judgements from experts or synthetically created. Until it develops an ability to judge its answer quality even if it can’t yet use that information to improve its responses.

It can be easier to recognize good responses than generate them.

Then feed it queries, generating its responses and judgements. Instead of training the responses to match response data, train it to output a high positive judgement, but while holding its “judgment” weight values constant. To improve its judgement values, the model is now being trained to give better answers since the judgment weights being back propagated act as a distributor of information from judgement back to how the responses should change to improve.

Learn to predict/judge what is good or bad. Then learn to maximize good and minimize bad using the judgment/prediction as a proxy for actual feedback.

This technique is closer to traditional human/animal reinforcement learning.

How we learn to predict situations that will cause us pain or positive affects, then learn to choose actions that minimize our predictions of bad, and maximize our predictions of good. Which is much more efficient way to learn than the expense of having to actually experience everything and always get explicit external feedback.

There are a many many ways to do reinforcement learning.

1 comments

varispeed 301 days ago

The snag is: 'experts' aren’t neutral oracles. Many are underpaid and end up parroting whoever funds them. Lobby groups quietly buy authority all the time. So the real challenge isn’t just training on expert judgments, it’s making the model sharp enough to spot the BS in those judgments - otherwise you’re just encoding the bias straight into the weights.

link

htfu 301 days ago

Which is why the foundation players must soon take on the additional role of being an ad buyer.

Interactive stuff, within content. A mini game in a game, school homework of course, or "whichever text box the viewer looks at longest by WorldCoin Eyeball Tracker for Democracy x Samsung" for an interstitial turned captcha.

Better hope your taste isn't too bland and derivative!

Amazon and Ali soon lap the field by allowing coupon farming, but somehow eventually end up where they started.

link

Nevermark 300 days ago

> The snag is: 'experts' aren’t neutral oracles.

Without knowing who/what the experts are, how they are used, what they are judging, what structure and mitigations are in place around their use, and what degree of neutrality is required - with all other factors and techniques being used - you can't make any such claim.

It's so easy to dismiss something.

A general algorithm isn't a claim that its practical use won't require accommodating the specific complications of each context.

Very much like how data scientists don't expect their best algorithms to operate well, without also resolving a stream of practical issues. In standard and ad hoc ways, as needed.

link