| I'm working on embedded systems and I've seen and heard some horror stories just on the device's side. Piles and piles of pre- and post-reboot shell scripts filled with race conditions against the system's services and themselves. When these break, if you're lucky a factory reset is enough to fix the system, if you're unlucky they become field bricks. I'm trying to buck the trend though and on the new embedded system I'm working on, I've specifically designed the upgrade system to be as reliable as I can make it. It goes something like this: - The new firmware is downloaded to the secondary application slot. - Just prior to rebooting, the entire state data of the system is serialized as a document and stored on a flash partition. - The upgrade flag is set, the system reboots and MCUboot does its thing. - The new firmware finds out a upgrade happened, clears out all the data partitions, restores from the document and then clears out its partition. The system is basically sanitized and restored after each upgrade. It's also the same codepath that handles saving and restoring the system's configuration by the end-user as well as settings management. If the document schema is for an older version, run the N-to-N+1 schema upgraders on it prior to applying instead of trying to patch the system in-place. If something goes horribly wrong, flip a jumper to trigger the heavy-duty sanitization that nukes the entire external flash (internal flash only contains the bootloader, primary application slot and factory parameters so it's essentially read-only once the application boots). It might be hubris, but I hope it's good enough that I'll never see a bricked card that can't be resurrected by a factory reset with this project (assuming no hardware damage, no internal flash corruption and no bricking firmware getting signed with production keys seeping through the cracks despite all the checks in place). |
This sort of functional interdependency has become increasingly common in embedded these days with heterogenous SoCs.
One thing I've seen before is to separate downloading from rebooting, broadcast the manifest for the updates between all the independent processors (all updates need a declarative manifest for so, so many reasons) to check locally, and only proceed when they all agree. Rollbacks are initiated if they can't see everyone with their expected versions afterwards.
Still isn't perfect either.