Hacker News new | ask | show | jobs
by smuser 931 days ago
My understanding (not an expert) is a lot of problem domains have very sparse / infrequent rewards - imagine if the only reward you gave a minecraft agent was when it mined a diamond, it would take a lot of gameplay for it to randomly do that and get a reward. So researchers spend time tuning the reward space (oh you mined some dirt, here's a tiny reward. Oh you mined rock, a greater reward, etc) but it's kind of akin to hand crafted feature detection from the pre-neural network days. The Q* mystery is did OpenAI 'solve' reward modelling the same way neural networks solved feature detection.
2 comments

Sounds like the process of tuning the reward space is a type of labelling and ranking problem. If I'm not mistaken, those are two things that GPT-4 is pretty good at. You wouldn't even necessarily pre-label every possible action since GPT-4 could do it in real time.
Reward for a successful prediction against a goal, then the nuance is defining a goal?