Hacker News new | ask | show | jobs
by gsjbjt 1645 days ago
Nice post! I work on NLP and I think a lot of ideas in this post resonate with what I find exciting about working on the intersection of language + the real world: large text datasets as sources of abundant prior knowledge about the world, structure of language ~ structure of concepts that matter to humans, etc.

I feel like the bottleneck is getting access to paired (language, other modality) data though (if your other modality isn't images). i.e. "bolt on generalization" is an intuitively appealing concept, but then it reduces to the hard problem of "how do I learn to ground language to e.g. my robot action space?" I haven't seen a robotics + language paper that actually grapples with the grounding problem / tries to think about how to scale the data collection process for language-conditioned robotics beyond annotating your own dataset as a proof-of-concept. Unlike language modeling / CLIP-type pretraining, it seems (fundamentally?) more difficult to find natural sources of supervision of (language, action). I'd be curious about your thoughts on this!

> When it comes to combining natural language with robots, the obvious take is to use it as an input-output modality for human-robot interaction. The robot would understand human language inputs and potentially converse with the human. But if you accept that “generalization is language”, then language models have a far bigger role to play than just being the “UX layer for robots”.

You should check out Jacob Andreas's work, if you haven't seen it already - esp. his stuff on learning from latent language (https://arxiv.org/abs/1711.00482).

1 comments

My hope is that sufficiently rich language models obviate the need for a lot of robot-language grounding data.

LfP (https://learning-from-play.github.io/) was a work that inspired me a lot. They relabel a few hours of open-ended demonstrations (humans instructed to play with anything in the environment) with a lot of hindsight language descriptions, and show some degree of general capability acquired through this richer language. You can describe the same action with a lot of different descriptions, e.g. "pick up the leftmost object unless it is a cup" could also be relabeled as "pick up an apple".

That being said, the LfP paper stops short of testing whether we can improve robotics solely by only scaling language - a confounding factor and central to their narrative was the role of "open-ended play data". We do need some paired data to ground (language, robot-specific sensor/actuator modalities), but perhaps we can scale everything else with language only data.

Thanks to the pointer on the Andreas paper! This is indeed quite relevant to the spirit of what I'm arguing for, though I prefer the implementation realized by the Lu et al '21 paper.

> We do need some paired data

A couple of under-explored rich sources of training data on actions are videos and code. Videos, showing how people interact with objects in the world to achieve goals, might also come with captions and metadata, while code comes with comments, messages and variable names that relate to real world concepts, including millions of tables and business logic.

Maybe in the future we will add rich brain scans as an alternative to text. That kind of annotation would be so easy to collect in large quantities, provided we can wear neural sensors. If it's impractical to scan the brain, we can wear sensors and video cameras and use eye tracking and body tracking to train the system.

I am optimistic that language modelling can become the core engine of AI agents, but we need a system that has both a generator and a critic, going back and forth for a few rounds, doing multi-step problem solving. Another must is to allow search engine queries in order to make more efficient and correct models, not all knowledge must be burned into the weights.

> My hope is that sufficiently rich language models obviate the need for a lot of robot-language grounding data.

I feel like this is “missing the trees for the forest.” In my experience, generality only emerges after a critical mass of detailed low-level examples is collected and arranged into a pattern. Humans can’t actually reason about purely abstract ideas very well. Experts always have specifics in mind they are working from.

So I'm not convinced leaving it to the model gets you anything new.

I feel that the (IMHO plausible) idea is that a sufficiently rich language model can enable transfer learning for robotics, where you can effectively replace a lot of robot-language grounding data with a small amount of robot-language grounding and a lot of pure language data.