|
|
|
|
|
by markisus
251 days ago
|
|
Can someone explain the bit counting argument in the reinforcement learning part? I don’t get why a trajectory would provide only one bit of information. Each step of the trajectory is at least giving information about what state transitions are possible. An infinitely long trajectory can explore the whole state space if there are no absorbing states. Such a trajectory would provide a massive amount of information about the system, even if we ignored the final reward. |
|
This is in contrast to more "supervised" forms of learning where you could get a loss for each token produced (e.g. cross entropy loss), and where you'd get, as a consequence O(number of tokens) information into your gradients.