Hacker News new | ask | show | jobs
by csours 3005 days ago
I took down an assembly plant by clicking on a Network status icon from a particular hardware supplier.

Over the weekend, firmware patches were applied, and the server rebooted. After reboot, everything worked fine, so the tech marked the change successful and went home.

Well, apparently the NICs would work just fine, but not all settings were applied until you opened the UI provided by the vendor. When you opened the UI, the final settings would be applied, and the NICs would reboot, just long enough to kill TCP connections.

That loss of TCP connection killed the parent system, and then all the other children systems also died when the parent died.

So who would you even blame there? The guy who set the tripwire? The guy who tripped on the tripwire? The guy who designed a system that could be brought down by a momentary loss of connection?

I'm lucky that my boss wasn't the type to point fingers, because I was the guy who was there when it happened, and it sure got a lot of attention.

2 comments

> [...] not all settings were applied until you opened the UI provided by the vendor. [...] the NICs would reboot, just long enough to kill TCP connections.

The UI part suggests that it was Windows, and if it was, it's not quite the case that "just long enough" to kill TCP connections, as you need quite a lot of downtime to terminate a typical TCP session.

In Windows, if a NIC goes down, all the TCP connections that use the NIC get closed immediately. (Or at least this was the case a few years ago. I had a similar system with similar drawbacks deployed back then, though it was an automated warehouse, not an assembly plant.)

> So who would you even blame there?

The idiots who designed the system to run on non-industrial-grade operating system. Windows was never a good choice to control industrial installations.

Windows is often the only vendor-supported choice for interfacing your computer applications to PLCs and such things. Also most of the proprietary protocols run over industrial ethernet are some kind of legacy serial (232, 485..) bytestream format wrapped in TCP and the software usually does not handle loss of the TCP connection particularly gracefully. (on multiple occasions I've seen rules like "reboot the whole installation on every shift change" to "handle" the obvious reliability issues of such systems)

It is not about some small and well defined set of "idiots", it is essentially industry-wide design mistake.

> Windows is often the only vendor-supported choice for interfacing your computer applications to PLCs and such things.

Which is not a problem by itself, since PLC, being an industrial equipment, should operate independently from a non-industrial equipment. The problem is idiots who think a desktop PC can reliably control PLC in real time.

Problem is when you have some kind of process that is inherently controlled not by the logic in PLC, but by some external system (either because the required data will not fit into PLC's data memory or because they constantly change based on some external bussines processes)

Reasonable architecture for this kind of problem would be attaching some server to the PLC as peripheral, but it tends to be done other way around. As for reasons for that I speculate that it is simply inertia of the typical PLC programmer which is then compounded by reasoning along the lines of nobody does that, so it is not tested and we will hit unknown bugs in the PLC firmware itself.

Is that a reference to Beckhoff?
> In Windows, if a NIC goes down, all the TCP connections that use the NIC get closed immediately.

Yes, that seems more likely.

I think Windows can be a decent platform for light industrial applications - which this system in particular was. The problem is all of the partners and suppliers were either stuck in the past, or had weird ideas.

The parent system was *nix based, but there was a flaw in a communications protocol that lead to the channel bouncing between two boxes, and eventually bringing down the parent system.

My lesson from that was that you can have flaws on any system, no matter how solid the OS.

one view that a lot of your colleagues may have had is that you just made clear to the company how relevant their jobs are (I am assuming most of the systems were built in house) and that decisions that were made in the interest of expediency can now be revisited in order to scope out additional work
Shhhh... stop telling secrets.