|
|
|
|
|
by gsjbjt
1645 days ago
|
|
Nice post! I work on NLP and I think a lot of ideas in this post resonate with what I find exciting about working on the intersection of language + the real world: large text datasets as sources of abundant prior knowledge about the world, structure of language ~ structure of concepts that matter to humans, etc. I feel like the bottleneck is getting access to paired (language, other modality) data though (if your other modality isn't images). i.e. "bolt on generalization" is an intuitively appealing concept, but then it reduces to the hard problem of "how do I learn to ground language to e.g. my robot action space?" I haven't seen a robotics + language paper that actually grapples with the grounding problem / tries to think about how to scale the data collection process for language-conditioned robotics beyond annotating your own dataset as a proof-of-concept. Unlike language modeling / CLIP-type pretraining, it seems (fundamentally?) more difficult to find natural sources of supervision of (language, action). I'd be curious about your thoughts on this! > When it comes to combining natural language with robots, the obvious take is to use it as an input-output modality for human-robot interaction. The robot would understand human language inputs and potentially converse with the human. But if you accept that “generalization is language”, then language models have a far bigger role to play than just being the “UX layer for robots”. You should check out Jacob Andreas's work, if you haven't seen it already - esp. his stuff on learning from latent language (https://arxiv.org/abs/1711.00482). |
|
LfP (https://learning-from-play.github.io/) was a work that inspired me a lot. They relabel a few hours of open-ended demonstrations (humans instructed to play with anything in the environment) with a lot of hindsight language descriptions, and show some degree of general capability acquired through this richer language. You can describe the same action with a lot of different descriptions, e.g. "pick up the leftmost object unless it is a cup" could also be relabeled as "pick up an apple".
That being said, the LfP paper stops short of testing whether we can improve robotics solely by only scaling language - a confounding factor and central to their narrative was the role of "open-ended play data". We do need some paired data to ground (language, robot-specific sensor/actuator modalities), but perhaps we can scale everything else with language only data.
Thanks to the pointer on the Andreas paper! This is indeed quite relevant to the spirit of what I'm arguing for, though I prefer the implementation realized by the Lu et al '21 paper.