Hacker News new | ask | show | jobs
by contingencies 528 days ago
As someone who has gone pretty deep in to robotics over the last 9 years I skipped right to the physical AI portion, and wasn't impressed.

This has been stated on HN in most robotics threads, but the core of what they show, once again, is content generation, a feature largely looking for an application. The main application discussed is training data synthesis. While there is value in this for very specific use cases it's still lipstick ("look it works! wow AI!") on a pig (ie. non-deterministic system being placed in a critical operations process). This embodies one of the most fallacious, generally unspoken assumptions in AI and robotics today - that it is desirable to deal with the real world in an unstructured manner using fuzzy, vendor-linked, unauditable, shifting sand AI building blocks. This assumption can make sense for driving and other relatively uncontrolled environments with immovable infrastructure and vast cultural, capital and paradigm investments demanding complex multi-sensor synthesis and rapid decision making based on environmental context based on prior training, but it makes very little sense for industrial, construction, agricultural, rural, etc. Industrial is traditionally all about understanding the problem or breaking it in to unit operations, design, fabricate and control the environment to optimize the process for each of those in sequence, and thus lowering the cost and increasing the throughput.

NVidia further wants us to believe we should buy three products from them: an embedded system ("nano"), a general purpose robotic system ("super") and something more computationally expensive for simulation-type applications ("ultra"). They claim (with apparently no need to proffer evidence whatsoever) that "all robotics" companies need these "three computers". I've got news for you: we don't, this is a fantasy, and limited if any value add will result from what amounts to yet another amorphous simulation, integration and modeling platform based on broken vendor assumptions. Ask anyone experienced in industrial, they'll agree. The industrial vendor space is somewhat broken and rife with all sorts of dodgy things that wouldn't fly in other sectors, but NVidia simply ain't gonna fix it with their current take, which for me lands somewhere between wishful thinking and downright duplicitous.

As for "digital twins", most industrial systems are much like software systems: emergent, cobbled together from multiple poor and broken individual implementations, sharing state across disparate models, each based on poorly or undocumented design assumptions. This means their view of self-state, or "digital twin", is usually functionally fallacious. Where "digital twins" can truly add value is in areas like functional safety, where if you design things correctly you avoid being mired in potentially lethally disastrous emergent states from interdependent subsystems that were not considered at subsystem design, maintenance or upgrade time because a non-exhaustive, insufficiently formal and deterministic approach was used in system design and specification. This very real value however hinges on the value being delivered at design time, before implementation, which means you're not going to be buying 10,000 NVidia chips, but most likely zero.

So my 2c is the Physical AI portion is basically a poorly founded forward-looking application sketch from what amounts to a professional salesman in a shiny black crocodile jacket at a purchased high-viz keynote. Perhaps the other segments had more weight.

2 comments

Upvoted for a real wall-of-text in an AI-generation era.
I wonder if you're missing the forest for the trees though. Google has done all of this already, but in a bespoke manner that is exclusive to Google DeepMind and they have discovered that it works really well.

>Industrial is traditionally all about understanding the problem or breaking it in to unit operations, design, fabricate and control the environment to optimize the process for each of those in sequence, and thus lowering the cost and increasing the throughput.

The thing is, you can do that using SayCan [0] [1]. You give your robot a natural language instruction and that instruction will be broken down to those unit operations and each unit operation is then performed by a robotics transformer model predicting the next action to move the mobile base and end-effector in cartesian space.

What Nvidia is going is just selling you the tools to do it yourself and one of the key parts is adapting simulation video data back into real world camera data. Robotics transformers work just as well as conventional LLMs in practice, but due to the lack of large scale data, they are limited to what they have seen in their training data, with okay performance on unseen tasks. Now here is the thing though, running a robotics transformer in real time is not feasible so far. Even Google's on-robot hardware was running the control loop at 3 Hz, which isn't fast enough.

[0] https://robotics-transformer1.github.io/img/saycan_rt1_demo3...

[1] https://say-can.github.io/img/demo_sequence_compressed.mp4

https://robotics-transformer1.github.io/

https://say-can.github.io/

you can do that using...

Yes, of course, TMTOWTDI. Respectfully, you sound like an academic or someone with only software-side experience, having less business side comprehension. It's common also with recent grads who often seem to view robotics through a ROS-influenced lens. Stated succinctly, generalist approaches such as robot arm work cells are usually low throughput, high cost, and have numerous limitations (eg. arms can't lift heavy or large items very well). While programming time may be sold as short, the reality is you're building in to a high capex business an avoidable dependence on a software stack which has a half-life of six months where people are hard to acquire and you could buy an objectively superior solution in terms of throughput, accuracy, spatial efficiency and maintainability without these problems.

Case in point: if you look at the build process for any relatively high complexity modern product (phones, cameras, etc.) they generally avoid using arms because they're relatively slow, weak, mandate large motion envelopes with OH&S implications and must be tediously integrated through vendor-linked APIs. There are better building blocks, and they don't involve tokenized fudge-factors. Yes, you could program them this way, no it generally won't add value. Arms are excellent for things like repeat precision welding (due to the high number of axes of motion offering superior approach and tracking options for target parts and sub-assemblies), but control in such cases comes from a combination of prescriptive part models and real time sensors for physical path following and not from AI fudge.