| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by smuser 978 days ago
	My understanding (not an expert) is a lot of problem domains have very sparse / infrequent rewards - imagine if the only reward you gave a minecraft agent was when it mined a diamond, it would take a lot of gameplay for it to randomly do that and get a reward. So researchers spend time tuning the reward space (oh you mined some dirt, here's a tiny reward. Oh you mined rock, a greater reward, etc) but it's kind of akin to hand crafted feature detection from the pre-neural network days. The Q* mystery is did OpenAI 'solve' reward modelling the same way neural networks solved feature detection.

2 comments

throwaway4aday 978 days ago

Sounds like the process of tuning the reward space is a type of labelling and ranking problem. If I'm not mistaken, those are two things that GPT-4 is pretty good at. You wouldn't even necessarily pre-label every possible action since GPT-4 could do it in real time.

link

pyinstallwoes 978 days ago

Reward for a successful prediction against a goal, then the nuance is defining a goal?

link