Hacker News new | ask | show | jobs
by modeless 790 days ago
These videos are all autonomous. They didn't have that in the 1950s.
1 comments

I can appreciate that, but also they are recording and replaying motor signals from specific teleoperation demonstrations. Something that _was_ possible in the 1950s. You might say that it is challenging to replay demonstrations well on lower-quality hardware. And so there is academic value in trying to make it work on worse hardware, but it would not be my goto solution for real industry problems. E.g. this is not a route I would fund for a startup, for example.
They do not replay recorded motor signals. They use recorded motor signals only to train neural policies, which then run autonomously on the robot and can generalize to new instances of a task (such as the above video generalizing to an adult size sweater when it was only ever trained on child size polo shirts).

Obviously some amount of generalization is required to fold a shirt, as no two shirts will ever be in precisely the same configuration after being dropped on a table by a human. Playback of recorded motor signals could never solve this task.

> recorded motor signals only to train neural policies

Is interesting that they are using "Leader Arms" [0] to encode tasks instead of motion capture. Is it just a matter of reduced complexity to get off the ground? I suppose the task of mapping human arm motion to what a robot can do is tough.

0. https://www.trossenrobotics.com/widowx-aloha-set

I appreciate that going from polo shirts to sweaters is a form of "generalisation" but that's only interesting because of the extremely limited capability for generalisation that systems have when they're trained by imitation learning, as ALOHA.

Note for example that all the shirts in the videos are oriented in the same direction, with the neck facing to the top of the video. Even then, the system can only straighten a shirt that lands with one corner folded under it after many failed attempts, and if you turned a shirt so that the neck faced downwards, it wouldn't be able to straighten it and hang it no matter how many times it tried. Let's not even talk about getting a shirt tangled in the arms themselves (in the videos, a human intervenes to free the shirt and start again). It's trained to straighten a shirt on the table, with the neck facing one way [1].

So the OP is very right. We're no nearer to real-world autonomy than we were in the '50s. The behaviours of the systems you see in the videos are still hard-coded, only they're hard-coded by demonstration, with extremely low tolerance for variation in tasks or environments, and they still can't do anything they haven't been painstakingly and explicitly shown how to do. This is a sever limitation and without a clear solution to it there's no autonomy.

On the other hand, ιδού πεδίον δόξης λαμπρόν, as we say in Greek. This is a wide open field full of hills to plant one's flag on. There's so much that robotic autonomy can't yet do that you can get Google to fund you if you can show a robot tying half a knot.

__________________

[1] Note btw that straightening the shirt is pointless: it will straighten up when you hang it. That's just to show the robot can do some random moves and arrive at a result that maybe looks meaningful to a human, but there's no way to tell whether the robot is sensing that it achieved a goal, or not. The straightening part is just a gimmick.

We're building software for neuromorphic cameras specifically for robotics. If robots could actually understand motion in completely unconstrained situations, then both optimal control and modern ML techniques would easily see uplift in capability (i.e. things work great in simulation, but you can't get good positions and velocities accurately and at high enough rate in the real world). Robots already have fast, accurate motors, but their vision systems are like seeing the world through a strobe light.
This is not preprogrammed replay. Replay would not be able handle even tiny variations in the starting positions of the shirt.
So, a couple things here.

It is true that replay in the world frame will not handle initial position changes for the shirt. But if the commands are in the frame of the end-effector and the data is object-centric, replay will somewhat generalize.(Please also consider the fact that you are watching the videos that have survived the "should I upload this?" filter.)

The second thing is that large-scale behavior cloning (which is the technique used here), is essentially replay with a little smoothing. Not bad inherently, but just a fact.

My point is that there was an academic contribution made back when the first aloha paper came out and they showed doing BC on low-quality hardware could work, but this is like the 4th paper in a row of sort of the same stuff.

Since this is YC, I'll add - As an academic (physics) turned investor, I would like to see more focus on systems engineering and first-principles thinking. Less PR for the sake of PR. I love robotics and really want to see this stuff take off, but for the right reasons.

> large-scale behavior cloning (which is the technique used here), is essentially replay with a little smoothing

A definition of "replay" that involves extensive correction based on perception in the loop is really stretching it. But let me take your argument at face value. This is essentially the same argument that people use to dismiss GPT-4 as "just" a stochastic parrot. Two things about this:

One, like GPT-4, replay with generalization based on perception can be exceedingly useful by itself, far more so than strict replay, even if the generalization is limited.

Two, obviously this doesn't generalize as much as GPT-4. But the reason is that it doesn't have enough training data. With GPT-4 scale training data it would generalize amazingly well and be super useful. Collecting human demonstrations may not get us to GPT-4 scale, but it will be enough to bootstrap a robot useful enough to be deployed in the field. Once there is a commercially successful dextrous robot in the field we will be able to collect orders of magnitude more data, unsupervised data collection should start to work, and robotics will fall to the bitter lesson just as vision, ASR, TTS, translation, and NLP before.

"Limited generalisation" in the real world means you're dead in the water. Like the Greek philosopher Heraclitus pointed out 2000+ years go, the real world is never the same environment and any task you want to carry out is not the same task the second time you attempt it (I'm paraphrasing). The systems in the videos can't deal with that. They work very similar to industrial robots: everything has to be placed just so with only centimeters of tolerance in the initial placement of objects, and tiny variations in the initial setup throw the system out of whack. As the OP points out, you're only seeing the successful attempts in carefully selected videos.

That's not something that you can solve with learning from data, alone. A real-world autonomous system must be able to deal with situations that it has no experience with, it has to be able to deal with them as they unfold, and it has to learn from them general strategies that it can apply to more novel situations. That is a problem that, by definition, cannot be solved by any approach that must be trained offline on many examples of specific situations.

Thank you for your rebuttal. It is good to think about the "just a stochastic parrot" thing. In many ways this is true, but it might not be bad. I'm not against replay. I'm just pointing out that I would not start with an _affordable_ 20k robot with fairly undeveloped engineering fundamentals. It's kind of like trying to dig a foundation to your house with a plastic beach shovel. Could you do it? Maybe, if you tried hard enough. Is it the best bet for success? doubtful.
The detail about end-effector frame is pretty critical as doing this BC with joint angles would not be tractable. You can tell there was a big shift from the RL approaches trying to do very generalizing algorithms to more recent works that are heavily focused on this arms/manipulators because end-effector control enables more flashy results.

Another limiting factor is that data collection is a big problem: not only will you never be sure you've collected enough data, they're collecting data of a human trying to do this work through a janky teleoperation rig. The behavior they're trying to clone is of a human working poorly, which isn't a great source of data! Furthermore limiting the data collection to (typically) 10Hz means that the scene will always have to be quasi-static, and I'm not sure these huge models will speed up enough to actually understand velocity as a 'sufficient statistic' of the underlying dynamics.

Ultimately, it's been frustrating to see so much money dumped into the recent humanoid push using teleop / BC. It's going to hamper the folks actually pursing first-principles thinking.

BC = Behavioural Cloning.

>> It's going to hamper the folks actually pursing first-principles thinking.

Nah.

What's your preferred approach?
What do you mean by saying that they're replaying signals from teleoperation demonstrations? Like in https://twitter.com/DannyDriess/status/1780270239185588732, was someone demonstrating how to struggle to fold a shirt, then they put a shirt in the same orientation and had the robot repeat the same motor commands?
Yeah, it's called imitation learning. Check out the Mobile ALOHA paper where they explain their approach in some detail:

https://arxiv.org/abs/2401.02117