| HN Mirror

Your hard mode is exactly the situation that RL is used, because it requires neither a corpus of correct examples, nor insight into the structure of a good policy.

> How can we confidently extract any signal from a failure to solve a problem if we don't even know if the problem is solvable?)

You rule out all the stuff that doesn’t work.

Yes this is difficult and usually very costly. Credit assignment is a deep problem. But if you didn’t find yourself in a hard mode situation, you wouldn’t be using RL.