| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by snovv_crash 1163 days ago

It really depends on the degree of granularity. ROS encourages the use of the actor model multiple times inside of the same machine. This is complete overkill, and actually reduces reliability and safety.

For example, how do you write unit tests for an actor-model system? Without unit tests, how do you properly characterize the code's behaviour? When I last did ROS work, I built the whole thing outside of ROS, tested and validated it worked with tests, and then put some small ROS wrappers on top, and it basically worked first time. But this isn't how ROS-native systems are developed, instead people use Gazebo/Rviz to tweak and add things, and you end up with a system that is grown organically, at the single algorithm level, with all the problems that entails.

As I posted cross-thread, in the actor model, with queues and threads, you inherently encode additional state via the temporal spacing of the messages. Trying to predict what all of these could be so that you can test for edge cases and make sure things are safe is basically impossible. The modularity of ROS lets you set up a giant system pretty quickly, but in order to iron out the edge cases takes about as much time as just rewriting the whole thing as a monolith, because you haven't actually been able to test the system properly and the long tail of hidden state and bugs is impossible to avoid, and also impossible to predict and test for.

From what I've seen of the ROS community, the concept of testing is severely lacking. It usually entails running simulations in lots of different scenarios, which in a testing hierarchy is only really your final integration tests. It doesn't tell you about degradations in various subsystems, eg. control or navigational ineffiencies. It doesn't tell you about regressions based on earlier behaviour. It isn't deterministic, so you get random failures, reducing trust in the testing infrastructure. It takes tons of compute, so your devs wait hours for something they should be able to know in seconds. And because it's slow, devs won't add tests to the same granularity they would otherwise.

In a high reliability environment deterministic code is really important. The actor model doesn't give you that, each and every time you cross its interface. It also makes abstractions for granular testing much more difficult. It isn't a silver bullet, and ROS leans so heavily on it that all of the downsides are effectively unmitigated and impossible to avoid.

It sounds like we're working in a similar space, for me it is drone obstacle avoidance and navigation systems, and I found ROS to be entirely unsuitable for anything more granular than inter-drone coordination.

1 comments

ModernMech 1162 days ago

> For example, how do you write unit tests for an actor-model system?

In an actor model, the units would be the actors. Test that they are deterministic and behave correctly given a message. You can test them for robustness by fuzzing messages and throw them at the actor. Then you use integration tests to test the whole system's performance.

> But this isn't how ROS-native systems are developed

Note that I haven't been arguing for ROS, but for loosely decoupled architectures for distributed systems like robots. I agree that ROS has many shortcomings. Although I would say this is not a shortcoming of ROS, but of ROS developers. Maybe ROS can be blamed for guiding people to work in such a way.

> As I posted cross-thread, in the actor model, with queues and threads, you inherently encode additional state via the temporal spacing of the messages.

Systems other than ROS do it better, but the point I've been trying to get across is that the actor model is great for distributed systems because it makes explicit the inextricable asynchronous, distributed nature of the system. As I've been arguing, you need to pass messages at some point if you want the robot to be a robot -- it has to interact with the world and society at some level, likely many levels. Your obstacle avoiding drone I assume is communicating with a base station, maybe remote compute, and a remote human operator. If we want to properly test this kind of system, we're going to have to make explicit the fact that the network is not reliable, latency is not zero, etc.

In this light, temporal spacing of messages, rather than being an encumbrance, becomes a necessity. It's a means to test and ensure that the system can handle all sorts of timings and orders of messages, just as it would need to do in the real world. By designing and conducting our tests to incorporate this, we can effectively simulate and anticipate the conditions our system will face.

Also, time-deterministic messaging protocols can be used to better manage this temporal aspect.

> you haven't actually been able to test the system properly and the long tail of hidden state and bugs is impossible to avoid, and also impossible to predict and test for.

But does the monolith avoid the edge cases or does it just fall for the fallacies of distributed computing?

> From what I've seen of the ROS community, the concept of testing is severely lacking.

Again, this seems like a shortcoming of the ROS community, and not the actor model.

link

snovv_crash 1162 days ago

For the drone example, the actor model works fine because each subsystem is safe. However if you have multiple components on the drone and want them to be managed by an actor model, as ROS would encourage, you introduce a world of uncertainty on an individual subsystem since that subsystem isn't actually autonomous on its own. Having more actors than strictly necessary due to the underlying physics of the problem is a huge issue.

> this light, temporal spacing of messages, rather than being an encumbrance, becomes a necessity.

And this is the crux of where we disagree. This is a messy part of reality which should be, as far as possible, abstracted away from the algorithms which need to operate on the data presented to them. If I'm running a Kalman filter I don't want to have to design in my filter around frequent gyroscope dropouts because image captures are happening, I want my system to have guaranteed behaviour that this won't happen. Actor model makes this harder by not giving me a way to have explicit guarantees, in fact it moves in the opposite direction by embracing flexibility.

While in general I agree that different components should be independently operable, as a system they will more than likely, in the real world, share various resources and you will need to deal with contention.

Any system which drastically increases overheads via serialisation, context changes, possibly network traffic and finally deserialisation in the place of a few instructions function call is a design which should be used very sparingly.

Actor model makes testing harder, and this results (again in the real world) in testing less. It also makes system level tests nondeterministic. Time deterministic protocols in place of function calls is just a nonstarter IMO. It's giving up control margin, increasing system load, and doesn't leave you any better with regard to system stability in case of failure.

Yes the actor model has its place, but at a very large granularity. Overuse, as in ROS, leads to horrible design constraints, opaque dependencies, difficult or impossible testing, and frankly impossible debugging.

Since you seem to be an actor model evangelist, how would you go about, just as an example, tracing execution flow in a debugger, for example? The data that gets passed into the actor interface is basically runtime-defined GOTOs. Similarly, how would you prove (in a certification perspective) that in certain scenarios the system as a whole behaves in a certain way, and fails in a safe way? Each subsystem can be proved to be safe, but the moment it goes through an async interface all bets are off.

link

dagar 1162 days ago

> And this is the crux of where we disagree. This is a messy part of reality which should be, as far as possible, abstracted away from the algorithms which need to operate on the data presented to them. If I'm running a Kalman filter I don't want to have to design in my filter around frequent gyroscope dropouts because image captures are happening, I want my system to have guaranteed behaviour that this won't happen.

Annoyingly in a lot of real world setups you can't have these guarantees. Your gyroscope, camera, etc are all producing data asynchronously often with slightly different clocks, and they all have different little idiosyncrasies and failure modes.

For example the Mars helicopter almost crashed because it missed a single frame. https://mars.nasa.gov/technology/helicopter/status/305/survi... If possible you absolutely want to fix the frame drop in the first place, but your algorithm should also be able to handle the drop out (or at least reset/recover).

link

snovv_crash 1161 days ago

Yes, single frame drops should be able to be handled, but shouldn't be expected to happen either. Both sides should be safe.

link

ModernMech 1162 days ago

> "The data that gets passed into the actor interface is basically runtime-defined GOTOs." ... "Any system which drastically increases overheads via serialisation, context changes, possibly network traffic and finally deserialisation in the place of a few instructions function call is a design which should be used very sparingly."

I think that your opinion of the actor model has been particularly colored by ROS, as these constraints aren't necessarily part of the actor model. It's an abstraction, and a formalism that is built around the idea of message passing, but that doesn't mean the actual implementation has to involve literal message passing. If a function call will really do the trick, a sufficient compiler can produce equivalent code.

But the question is... is the function call really synchronous. For instance, you give the example of a gyroscope attached to a kalman filter, but what about a GPS? What happens when the GPS becomes unavailable, and the kalman filter doesn't get any more updates? Indeed many (a majority actually) of my sensors have Ethernet interfaces, and we communicate with the sensors over networks that include routers. Some of the robot's sensors are external to the robot itself, and we communicate with them over a wireless network. So when you say this:

> Each subsystem can be proved to be safe, but the moment it goes through an async interface all bets are off.

I find myself in full agreement! But you cross that async interface as soon as you want data from your sensors, because the sensor interface is asynchronous. So you might as well deal with the asynchrony explicitly.

> Since you seem to be an actor model evangelist, how would you go about, just as an example, tracing execution flow in a debugger, for example?

Typically what I look at are message traces. What's nice about actor model is it lends itself to new ways of debugging, like time travel debugging. It's also a formalism, so we can leverage that formalism to prove properties of the program.

> how would you prove (in a certification perspective) that in certain scenarios the system as a whole behaves in a certain way, and fails in a safe way?

I guess it would depend on what system you're trying to certify and to what standard. If you have something in mind, how would you imagine going it ideally, and then maybe I can try to respond as to how my mind would wrap around it.

link

snovv_crash 1161 days ago

Can you give another popular implementation of the actor model as a counterexample? Happy to learn, but I care more about what can practically done for actual industrial use cases than what can be done on paper or behind closed doors somewhere with critical details for 3rd parties missing.

link

ModernMech 1160 days ago

In terms of general actor model implementations, there’s the BEAM VM, Akka for Java, and even Rust’s Tokio async system is backed by the actor model.

Your point is taken about industrial robotics lacking proper tools here —- it’s often the case that industry is about 10-15 years behind research in the field of robotics. But ROS is already being used in industry, and I imagine in 15 years industry will enjoy improved tooling currently being used in research labs to support better testing and reliability in robotic systems.

link

snovv_crash 1151 days ago

The people I've worked with in the drone space who actually deliver working products won't touch ROS with a 10-foot barge pole. There are plenty of people using 10x the resources to deliver 10% of the product who are embracing ROS though, and 90% of the work is getting all the different ROS components to play nicely with each other at the same time. Never again.

link