| HN Mirror

Absolutely, maximizing conditional probabilities is easily modeled as a Markov decision process, which is why you can use RL to train Transformers so well (hence RLHF, I've also been experimenting with RL based training for Transformers for other applications - it's promising!). Using a transformer as a model for RL to try to choose tokens to maximize overall likelihood given immediate conditional likelihood estimation is something that I imagine many people experimented with, but I can see it being tricky enough for OpenAI to be the only ones to pull it off.