Hacker News new | ask | show | jobs
by eli 5124 days ago
Of course a RAID controller isn't supposed to fail silently, but it can and it does. I can't think of many complex pieces of technology that work 100% all of the time.
1 comments

You don't think that something that only exists to create disk redundancy is in a different category from complex pieces of technology that don't have this in their name?

I simply disagree that you should "never underestimate" your raid controller's ability to fail silently (which is the comment I was replying to). If this is even on your radar you don't have a RAID controller.

This is literally like saying. "Never underestimate your digest algorithm's ability to hash the same file to different values, making the checksum seem to fail." That's not a digest algorithm, that's a randomized print statement.

A RAID controller you should 'never underestimate' the ability of to fail silently is literally sometimes the same as a paper plate with "raid controller" written on it. Call it "sometimes raid". or "maybe raid" or "more raid". You don't have a raid controller.

In general, then, we shouldn't call RAM memory since it might misremember, we shouldn't call them computers since they might miscompute, we shouldn't call it encryption since it might not encrypt, and we shouldn't call them bicycles since one of the wheels might fall off. Is that about it? I think I can see your point, but as we're nowhere near the point at which machines become as reliable as humans, let alone utterly reliable, I'm not quite sure of the use of the fine distinction you're trying to draw.
No, I don't at all mean in general.

(See my cousin reply here).

That is not at all "about it". I mean, specifically, for the layer that a RAID produces. It's simple. When you add RAID, you add a layer on top of physical hard-drives to make them redundant.

This type of layer has a completely different expectation from all of your other examples. The example in my cousin reply is apt: it would be like expecting a checksummong algorithm (which you're ONLY using to add verification that a file is genuine) to sometimes fail and produce a random checksum in the space of possible checksums the algorithm can produce, instead of the checksum that the algorithm actually produces for that particular file. Or if it has a comparison function, to sometimes fail and say that the file checksums to the provided checksum, regardless of whether it does so.

This is ridiculous: such a layer wouldn't be a checksum, it would be completely different. The idea that I have to physically roll a layer on top of my checksum, to check whether it's currently acting like a randomized print statement or a comparison function whose truth value is randomly negated, is ridiculous.

I don't know how else to put this. Maybe instead of your RAM, bicycle, examples, I can give you these examples: -> Imagine if you are adding a fuse to a circuit to protect it, but the fuse sometimes actually just saves up electricity so it can release it one quick burst and override the circuit. That's not a fuse.

-> Imagine if you hire an auditor to make sure your employees aren't misappropriating funds, since the business involves a lot of cash, but your auditor sometimes just pockets cash. That's not an auditor. You only thought you hired an auditor. The solution isn't to make sure the auditor has an auditor, it's to hire an actual auditor instead of someone you mistakenly think is one.

-> Imagine if you buy insurance, but actually the company sometimes will just spend lawyers on defending having to pay out, even when the event clearly happened and you were clearly covered. That's not insurance - that's a scam. You shouldn't have to insure the layer of insurance with an insurance against the insurance company out-lawyering you. You should get an actual insurance policy.

-> Imagine if you buy a seatbelt, but after buckling it, there is a realistic chance that you really haven't, and it's just a clothing item draped across your body and not attached in any way at any point.

Well if that's possible, that's just not a seatbelt. It's a defective item that was supposed to be a seatbelt but isn't.

The point is, all these examples are optional layers on TOP of a process. If they have a realistic chance of failing as in the above descriptions, they simply are not what they're claiming to be. Their chance of failure should be so low you can't even think about it; if it isn't, you should just hire or buy a different on, since you made a mistake.

I don't know what you're getting at, but I think you are underestimating the chance that your RAID controller will fail.
Actually, I don't have a RAID controller. But at least I don't think I have one!

If I did get one, I wouldn't get one that tended to silently fail, since that would pretty well defeat the purpose of thinking my disks were redundant, wouldn't it?

"Tend to fail silently" is different from "could fail silently at some point, though it is unlikely." You are correct that RAID controllers shouldn't be the former; however, there is absolutely no way to prevent RAID controllers from being the latter.
Could you tell me your reasoning about the last sentence?

If you were talking about anything else, like a normal hard-drive, fine, of course it can fail silently. But the whole thing that a RAID drive is, is another layer on top of hard-drives, to make them redundant and chirp loudly when one dies or starts having wrong data and has to be removed, so that you can replace it and rebuild the RAID.

I mean, all the RAID controller does is write data that is always redundant (even when it thinks all drives are working fine). How is it not possible for it to check for this consistency as well? Especially in Raid-6 etc configurations, which are even more consistent?

Of course, on a probabilistic level random bit rot means "nothing is certain", but on a practical level, how can you not expect a raid controller not to fail silently, when all it does is corral redundant data around, create checksums, verify what's written, etc. It's the whole reason it exists.

To me this is like saying that a checksumming algorithm should be expected to sometimes fail and just return a checksum chosen randomly from the space of all possible checksums, instead of the checksum actually produced by the algorithm for that data.

That's ridiculous. I shouldn't have to even think about putting another layer on top of the checksum, so that I can checksum it. The very idea of having to do that means you don't have a checksumming algorithm.

This thing should be right up there with bitrot causing bash to execute an rm -rf whenever you drop down to root. Sure that's possible, but that's not even in the scope of anything you have to think about.

To me, a RAID is a layer on top of hard-drives that makes them redundant. Any controller that has a realistic chance of failing silently simply does not fit that definition.

Yes, the purpose of RAID is not to fail silently; but it's hardware, and hardware can be flawed. That's just a fact. There are different levels of RAID precisely because there are different levels of redundancy - that is, different extents to which the possibility of failure is minimized. No hardware is flawless, though.

Please note that I have not said "it is likely to fail" or "you should expect that it will probably fail." I agree that it shouldn't be something that keeps a person up at night. But the simple fact is that, when data is important, you should prepare for that possibility (and others) by backing up. RAID does not solve all problems, and it is not guaranteed, as unlikely as failure might be.

Moreover - in saying that it simply isn't RAID if it ever fails silently, you're attempting to define away a nonsemantic problem. The point of a starter motor on a car is to start the engine. If the starter motor fails to start the engine, I guess I could make an Aristotelian argument that it has ceased to be a starter motor, or even perhaps that it was never a starter motor in the first place. But what practical good does that do anybody?

All hardware has the potential to fail. Yes, people should buy hardware that is less likely to fail. I'm pretty sure they already do that, though.