| HN Mirror

That is a solution, but not the cause. The cause is not having a culture that evaluates failure scenarios. From what I have read:

  * Updates are not vetted or sanity checked.
  * Updates are not slow-rolled to production.
  * Updates are not signed to prevent corruption or alteration.
  * Updater does not sanitize or validate inputs.
  * Updater does not have a reversion process to previously known good position on faulty boot.
  * Updater should mark itself as Unnecessary For Boot on faulty boot at some point.

Finally, its high adoption means it creates a mono-culture. There should be another version built independently where one is running on a machine and another sits in a ready state. If there is a fault in one, it becomes disabled and the second takes over. Good ol' NASA style redundancy.