|
|
|
|
|
by marcolinux
3818 days ago
|
|
From the article:
"...a black box monitoring of the system is not sufficient. Black box monitoring, including external monitors, canaries, and so on, only tell the system which side of an externally visible failure boundary a system is on. Many kinds of systems, including nearly every kind that includes some redundancy, can move towards this boundary through multiple failures without crossing it. Black-box monitoring misses these internal state transitions. Catching them can significantly improve the actual, real-world, durability and availability of a system." The author uses RAID, but this observation is valid for systems in general.I really wished that was some kind of guidance, manual or best practices available on how to design and/or auto-regulate those internal state transitions. |
|
The naive approach would be linear: treat two half-failed drives in a mirror as having one whole failed drive. That wouldn't work too well, though mostly because "half-failed" is actually nearly already failed.
This seems like the kind of thing that could use a "calibration curve"—where you observe how the reported health actually correlates to the remaining MBTF, and then divide future reported-health estimates by that correlation-curve to get their actual health.
I'm guessing the calibration curve will just, itself, end up being a bathtub curve—which means that e.g. a drive with one or two SMART errors would need to be considered "already on its way out." But sometimes such drives live a long time—as a matter of cost, it's probably too expensive to throw out every disk that will probabilistically fail soon. It might be possible, though, to move them to some sort of "non-front-line" service instead, maybe moved to Dynamo-like n=17 highly-redundant storage. (I wonder if AWS actually "recycles" EBS volumes into Dynamo/S3 volumes this way.)
---
As an aside, I've always wondered why we don't use calibration curves more. They're great for estimating a lot of things:
• Remaining battery life: sort of works this way (in that battery output is fairly constant until the battery "runs out"; 10% remaining = battery at a slightly lower voltage, 0% remaining = battery still plenty charged but no longer charged enough to output the proper voltage for the device.) But could be calibrated way better by actually correlating reported battery life to time, as observed by the device during its service life. This is made harder because we don't usually let batteries drain dry, though we do let them get into that precarious 10% "suboptimal voltage" case quite often. A properly-calibrated battery report should discharge linearly and recharge on an S-curve, rather than the other way 'round.
• Progress bars. If you're an OS manufacturer and you want to distribute an OS update, deploy it to a bunch of test machines and track how long each phase of the update takes, average them a bit pessimistically (maybe take the third-sigma median.) Now you can make a progress bar that appears to fill linearly, and gives a real, calibrated estimate on time-to-completion. The bar can be fronting pretty oblivious software, as long as it's split into phases itself: each phase can be kicked off and then the bar can just ease between N% and (N+P)% over the (pessimistically) estimated phase time, quickly cubic-sliding up to (N+P)% if the phase completes early. (I do know one piece of software that actually did things this way: Mac OS, pre-System-7.)