| HN Mirror

Upvoted for a real wall-of-text in an AI-generation era.

I wonder if you're missing the forest for the trees though. Google has done all of this already, but in a bespoke manner that is exclusive to Google DeepMind and they have discovered that it works really well.

>Industrial is traditionally all about understanding the problem or breaking it in to unit operations, design, fabricate and control the environment to optimize the process for each of those in sequence, and thus lowering the cost and increasing the throughput.

The thing is, you can do that using SayCan [0] [1]. You give your robot a natural language instruction and that instruction will be broken down to those unit operations and each unit operation is then performed by a robotics transformer model predicting the next action to move the mobile base and end-effector in cartesian space.

What Nvidia is going is just selling you the tools to do it yourself and one of the key parts is adapting simulation video data back into real world camera data. Robotics transformers work just as well as conventional LLMs in practice, but due to the lack of large scale data, they are limited to what they have seen in their training data, with okay performance on unseen tasks. Now here is the thing though, running a robotics transformer in real time is not feasible so far. Even Google's on-robot hardware was running the control loop at 3 Hz, which isn't fast enough.

[0] https://robotics-transformer1.github.io/img/saycan_rt1_demo3...

[1] https://say-can.github.io/img/demo_sequence_compressed.mp4

https://robotics-transformer1.github.io/

https://say-can.github.io/

you can do that using...

Yes, of course, TMTOWTDI. Respectfully, you sound like an academic or someone with only software-side experience, having less business side comprehension. It's common also with recent grads who often seem to view robotics through a ROS-influenced lens. Stated succinctly, generalist approaches such as robot arm work cells are usually low throughput, high cost, and have numerous limitations (eg. arms can't lift heavy or large items very well). While programming time may be sold as short, the reality is you're building in to a high capex business an avoidable dependence on a software stack which has a half-life of six months where people are hard to acquire and you could buy an objectively superior solution in terms of throughput, accuracy, spatial efficiency and maintainability without these problems.

Case in point: if you look at the build process for any relatively high complexity modern product (phones, cameras, etc.) they generally avoid using arms because they're relatively slow, weak, mandate large motion envelopes with OH&S implications and must be tediously integrated through vendor-linked APIs. There are better building blocks, and they don't involve tokenized fudge-factors. Yes, you could program them this way, no it generally won't add value. Arms are excellent for things like repeat precision welding (due to the high number of axes of motion offering superior approach and tracking options for target parts and sub-assemblies), but control in such cases comes from a combination of prescriptive part models and real time sensors for physical path following and not from AI fudge.