Seems like the space is still very early ->
Fast Inference and Transfer of Compositional Task Structures for Few-shot Task Generalization: https://proceedings.mlr.press/v180/sohn22a/sohn22a.pdf
Learning to Navigate the Web: https://arxiv.org/pdf/1812.09195.pdf
A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility: https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136...
MiniWoB++: a web interaction benchmark for reinforcement learning: https://github.com/Farama-Foundation/miniwob-plusplus