| HN Mirror

In addition to the papers on end-to-end learning for robotics, it might also be worth reading about the state-of-the art in classical robotics. There's a lot of debate in the field about whether end-to-end learning and scaling will solve robotics[1]. On the E2E side, there's the bitter lesson, scaling for LLMs and other AI success cases. On the skeptical side, there's the reliability limit (has anyone seen any ML cross the 1 failure out 100,000 barrier on real data?), and the bitter-er lesson (scaling on search can be better than scaling on data and classical robotics is scaling search instead of data). Data availability is a blocker for research, but in production many use-cases are profitable with teleop so data can be collected profitably, especially with UX design to make teleop more efficient.

Navigating in stair-free commercial environments was solved in mid-2009 by classical planning + SLAM with LIDAR, and open-sourced in the ROS navstack. A LIDAR-free version using stereo cameras was also open-sourced shortly thereafter. The navstack is still maintained and integrated by Open Robotics[2] and Opennav[3]. These techniques (and in many cases forks of the OSS code) power e.g. 10,000 bear.ai robots in restaurants today, as well as some of the newer Roombas. All of this is CPU-only, and can run on a NUC.

Classical planning has also solved arm navigation quite well. The modern technology here is MoveIt! 2[4]. MoveIt! uses essentially the CAD model of the arm (which most robot manufacturers provide in the correct format) plus data about objects in the environment from sensors to plan motions. There are modules to create smoother, human-like motions as well. All of this works efficiently on CPU-only.

Lastly, LIDAR-less SLAM and mapping is also starting: https://docs.luxonis.com/software/ros/vio-slam. LIDAR costs have also fallen to the point where robot vacuums are sold with integrated LIDARs.

The main area where classical has not made as much progress is on soft objects (e.g folding towels) and on object detection. Classical point-cloud based object detection for example is based on correspondence grouping[5], but overall everyone is using at least partially neural nets for these problems.

As for end-to-end in prod without human-in-the-loop, covariant and ambi are the only cases I've seen so far. They benefit from having the ability to have a classical safety layer and a classical success detector via e.g. object weights (I'm not sure what approach they are using, I've just seen object weight elsewhere). With that they can get the much-desired data flywheel effect of self-improving systems.

1. https://spectrum.ieee.org/solve-robotics 2. https://openrobotics.org 3. https://opennav.org 4. https://moveit.picknik.ai/humble/index.html 5. https://pcl.readthedocs.io/projects/tutorials/en/latest/corr...