| It seems like there's a possibility for a general rule-of-thumb (that could be operationalized) in systems that try to use redundancy to increase fault-tolerance. The naive approach would be linear: treat two half-failed drives in a mirror as having one whole failed drive. That wouldn't work too well, though mostly because "half-failed" is actually nearly already failed. This seems like the kind of thing that could use a "calibration curve"—where you observe how the reported health actually correlates to the remaining MBTF, and then divide future reported-health estimates by that correlation-curve to get their actual health. I'm guessing the calibration curve will just, itself, end up being a bathtub curve—which means that e.g. a drive with one or two SMART errors would need to be considered "already on its way out." But sometimes such drives live a long time—as a matter of cost, it's probably too expensive to throw out every disk that will probabilistically fail soon. It might be possible, though, to move them to some sort of "non-front-line" service instead, maybe moved to Dynamo-like n=17 highly-redundant storage. (I wonder if AWS actually "recycles" EBS volumes into Dynamo/S3 volumes this way.) --- As an aside, I've always wondered why we don't use calibration curves more. They're great for estimating a lot of things: • Remaining battery life: sort of works this way (in that battery output is fairly constant until the battery "runs out"; 10% remaining = battery at a slightly lower voltage, 0% remaining = battery still plenty charged but no longer charged enough to output the proper voltage for the device.) But could be calibrated way better by actually correlating reported battery life to time, as observed by the device during its service life. This is made harder because we don't usually let batteries drain dry, though we do let them get into that precarious 10% "suboptimal voltage" case quite often. A properly-calibrated battery report should discharge linearly and recharge on an S-curve, rather than the other way 'round. • Progress bars. If you're an OS manufacturer and you want to distribute an OS update, deploy it to a bunch of test machines and track how long each phase of the update takes, average them a bit pessimistically (maybe take the third-sigma median.) Now you can make a progress bar that appears to fill linearly, and gives a real, calibrated estimate on time-to-completion. The bar can be fronting pretty oblivious software, as long as it's split into phases itself: each phase can be kicked off and then the bar can just ease between N% and (N+P)% over the (pessimistically) estimated phase time, quickly cubic-sliding up to (N+P)% if the phase completes early. (I do know one piece of software that actually did things this way: Mac OS, pre-System-7.) |