Hacker News new | ask | show | jobs
by its_so_on 5125 days ago
Actually, I don't have a RAID controller. But at least I don't think I have one!

If I did get one, I wouldn't get one that tended to silently fail, since that would pretty well defeat the purpose of thinking my disks were redundant, wouldn't it?

1 comments

"Tend to fail silently" is different from "could fail silently at some point, though it is unlikely." You are correct that RAID controllers shouldn't be the former; however, there is absolutely no way to prevent RAID controllers from being the latter.
Could you tell me your reasoning about the last sentence?

If you were talking about anything else, like a normal hard-drive, fine, of course it can fail silently. But the whole thing that a RAID drive is, is another layer on top of hard-drives, to make them redundant and chirp loudly when one dies or starts having wrong data and has to be removed, so that you can replace it and rebuild the RAID.

I mean, all the RAID controller does is write data that is always redundant (even when it thinks all drives are working fine). How is it not possible for it to check for this consistency as well? Especially in Raid-6 etc configurations, which are even more consistent?

Of course, on a probabilistic level random bit rot means "nothing is certain", but on a practical level, how can you not expect a raid controller not to fail silently, when all it does is corral redundant data around, create checksums, verify what's written, etc. It's the whole reason it exists.

To me this is like saying that a checksumming algorithm should be expected to sometimes fail and just return a checksum chosen randomly from the space of all possible checksums, instead of the checksum actually produced by the algorithm for that data.

That's ridiculous. I shouldn't have to even think about putting another layer on top of the checksum, so that I can checksum it. The very idea of having to do that means you don't have a checksumming algorithm.

This thing should be right up there with bitrot causing bash to execute an rm -rf whenever you drop down to root. Sure that's possible, but that's not even in the scope of anything you have to think about.

To me, a RAID is a layer on top of hard-drives that makes them redundant. Any controller that has a realistic chance of failing silently simply does not fit that definition.

Yes, the purpose of RAID is not to fail silently; but it's hardware, and hardware can be flawed. That's just a fact. There are different levels of RAID precisely because there are different levels of redundancy - that is, different extents to which the possibility of failure is minimized. No hardware is flawless, though.

Please note that I have not said "it is likely to fail" or "you should expect that it will probably fail." I agree that it shouldn't be something that keeps a person up at night. But the simple fact is that, when data is important, you should prepare for that possibility (and others) by backing up. RAID does not solve all problems, and it is not guaranteed, as unlikely as failure might be.

Moreover - in saying that it simply isn't RAID if it ever fails silently, you're attempting to define away a nonsemantic problem. The point of a starter motor on a car is to start the engine. If the starter motor fails to start the engine, I guess I could make an Aristotelian argument that it has ceased to be a starter motor, or even perhaps that it was never a starter motor in the first place. But what practical good does that do anybody?

All hardware has the potential to fail. Yes, people should buy hardware that is less likely to fail. I'm pretty sure they already do that, though.

Hi,

You might read this first:

http://news.ycombinator.com/item?id=4057912

you can reply to that as well here if you want.

I think we're in very general agreement. Although you yourself did not say "it is likely to fail" or "you should expect that it will fail", this is exactly the sentiment I was replying to was.

Regarding your "all hardware possibly failing" and the example of a starter motor to imply that I am trying to disappear a technical problem with a semantic argument, I think I am (especially in that cousin reply) being quite a bit more specific.

Basically, when it comes to safety mechanisms that exist as a layer on top of a process and aren't necessary at all, I simply shouldn't have to even think about reinventing another safety mechanism on top of the safety mechanism. Get one that isn't defective.

A hard-drive isn't defective just because it fails: it's expected to. A RAID controller is also expected to fail...JUST NOT SILENTLY.

In the seatbelt example: should you even think about having to tie your seatbelt to the buckle with sturdy rope, for real safety in case the seatbelt just doesn't buckle when it seems to, or comes undone like a ripped shirt button at the slightest firm tug?

No. You should get an actual seatbelt.

Basically, the standard you hold a control layer to is different from the standard you hold an underlying process to.

It would be like the difference between your brake failing and your (for added safety) handbreak failing, which you only engage on top of the motor's brake anyway. If the motor brake fails you would start rolling (if you're on a bit of an incline). But you shouldn't even have to think about a hand-brake 'just failing' in the same condition.

Sure it can fail if you are being towed without being lifted, or whatever, in an extreme situation. But in a normal situation?

Basically, it is a difference of both category/kind AND of degree.

I am certainly not saying that a parking brake can never fail. I am not saying a raid controller can never fail.

I am saying that both of these, when they are layers on top of a normal process, should be out of sight, below your threshold of having to control for it. If they're not, you need to get a different one.

You don't get six insurance policies against the same earthquake possibility, hoping that they won't ALL decide to out-lawyer you or go bankrupt. You get real insurance that's properly reinsured. Check up on them. Find a real one.

Raid failure is fine. Silent raid failure is not fine.

(checksum failure with an exception is fine; checksum failure with no exception, warning, or error, just a random checksum produced - or a check randomly passing when the checksum doesn't match the one you provided, is not okay. fix your checksum, get a real one - don't build another layer on top, for the cases that your checksum is a randomized print statement or your insurance policy a monthly donation from you to a non-charitable organization that puts aside a portion to out-lawyer you with if you try to make a claim, with the rest spent on advertising or being their profit. That's not an insurance policy, that's a scam.)

Yeah, I think we're in general agreement. With RAID that is the way RAID is supposed and expected to be, the chance of silent failure very, very small.