The problem is mostly that it's fairly intensive to code an efficient RL trainer for this, and even then it's expensive to run the training.