| I took down an assembly plant by clicking on a Network status icon from a particular hardware supplier. Over the weekend, firmware patches were applied, and the server rebooted. After reboot, everything worked fine, so the tech marked the change successful and went home. Well, apparently the NICs would work just fine, but not all settings were applied until you opened the UI provided by the vendor. When you opened the UI, the final settings would be applied, and the NICs would reboot, just long enough to kill TCP connections. That loss of TCP connection killed the parent system, and then all the other children systems also died when the parent died. So who would you even blame there? The guy who set the tripwire? The guy who tripped on the tripwire? The guy who designed a system that could be brought down by a momentary loss of connection? I'm lucky that my boss wasn't the type to point fingers, because I was the guy who was there when it happened, and it sure got a lot of attention. |
The UI part suggests that it was Windows, and if it was, it's not quite the case that "just long enough" to kill TCP connections, as you need quite a lot of downtime to terminate a typical TCP session.
In Windows, if a NIC goes down, all the TCP connections that use the NIC get closed immediately. (Or at least this was the case a few years ago. I had a similar system with similar drawbacks deployed back then, though it was an automated warehouse, not an assembly plant.)
> So who would you even blame there?
The idiots who designed the system to run on non-industrial-grade operating system. Windows was never a good choice to control industrial installations.