| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ender7 5172 days ago
	Never underestimate your RAID controller's ability to fail (silently!) and start writing corrupted garbage to your disks.

3 comments

wayne_h 5171 days ago

I once did a RAID data recovery on a system with a high-end Intel raid controller. The controller failed and they sent a new one - only the new one couldn't assemble the raid properly. It turns out that there was a flaw in the logic for where parity was stored. Normally parity is spread evenly across all the drives - not on this version. I had to reverse-engineer the crazy raid pattern and write a program to deraid it. It had gone undetected - as long as it was running.

link

wayne_h 5171 days ago

and don't underestimate the manufacturers ability to screw it up. Like when people 'upgrade' the firmware in their Buffalo NAS box and after that they can't see the data anymore. Luckily the data was still there and undamaged but it took data recovery to get it back.

link

its_so_on 5172 days ago

EDIT: people didn't like my humor. Well, look, the whole thing that you're buying with a raid controller is...redundancy. So if it's not redundant, failing silently, while telling you it's being redundant, how is this different from, say, paying for a house inspection that doesn't get done? If a raid controller is allowed to silently fail, it becomes a post-experience good.

http://en.wikipedia.org/wiki/Experience_good

Meaning that even while you're using it, you have no idea if it works.

My contention is that it's not a raid array if it can silently stop being redundant without telling you.

At best it's an Possibly Redundant Array of Inexpensive Disks.

(The below is how my comment first read.)

(sarcastic) Yeah, it's only prudent to grab a drive out from time to time and make a surprise inspection of whether it's actually filled up a full 4/5th of the way (or whatever) with the actual data the volume is supposed to contain! And the remaining fifth had better look a damn sight like parity information!

Seriously though, a controller that fails like this isn't a RAID controller, since what separates it from a paper plate and a cardboard box. On the paper plate you write "RAID controller" and tape it to an already attached hard drive, and you put the remaining members of the redundant array into the cardboard box. No setup or even connection required!

seriously seriously though, what you're suggesting is unacceptable. that's not a raid controller, that's a scam.

link

eli 5172 days ago

Of course a RAID controller isn't supposed to fail silently, but it can and it does. I can't think of many complex pieces of technology that work 100% all of the time.

link

its_so_on 5172 days ago

You don't think that something that only exists to create disk redundancy is in a different category from complex pieces of technology that don't have this in their name?

I simply disagree that you should "never underestimate" your raid controller's ability to fail silently (which is the comment I was replying to). If this is even on your radar you don't have a RAID controller.

This is literally like saying. "Never underestimate your digest algorithm's ability to hash the same file to different values, making the checksum seem to fail." That's not a digest algorithm, that's a randomized print statement.

A RAID controller you should 'never underestimate' the ability of to fail silently is literally sometimes the same as a paper plate with "raid controller" written on it. Call it "sometimes raid". or "maybe raid" or "more raid". You don't have a raid controller.

link

hythloday 5172 days ago

In general, then, we shouldn't call RAM memory since it might misremember, we shouldn't call them computers since they might miscompute, we shouldn't call it encryption since it might not encrypt, and we shouldn't call them bicycles since one of the wheels might fall off. Is that about it? I think I can see your point, but as we're nowhere near the point at which machines become as reliable as humans, let alone utterly reliable, I'm not quite sure of the use of the fine distinction you're trying to draw.

link

its_so_on 5171 days ago

No, I don't at all mean in general.

(See my cousin reply here).

That is not at all "about it". I mean, specifically, for the layer that a RAID produces. It's simple. When you add RAID, you add a layer on top of physical hard-drives to make them redundant.

This type of layer has a completely different expectation from all of your other examples. The example in my cousin reply is apt: it would be like expecting a checksummong algorithm (which you're ONLY using to add verification that a file is genuine) to sometimes fail and produce a random checksum in the space of possible checksums the algorithm can produce, instead of the checksum that the algorithm actually produces for that particular file. Or if it has a comparison function, to sometimes fail and say that the file checksums to the provided checksum, regardless of whether it does so.

This is ridiculous: such a layer wouldn't be a checksum, it would be completely different. The idea that I have to physically roll a layer on top of my checksum, to check whether it's currently acting like a randomized print statement or a comparison function whose truth value is randomly negated, is ridiculous.

I don't know how else to put this. Maybe instead of your RAM, bicycle, examples, I can give you these examples: -> Imagine if you are adding a fuse to a circuit to protect it, but the fuse sometimes actually just saves up electricity so it can release it one quick burst and override the circuit. That's not a fuse.

-> Imagine if you hire an auditor to make sure your employees aren't misappropriating funds, since the business involves a lot of cash, but your auditor sometimes just pockets cash. That's not an auditor. You only thought you hired an auditor. The solution isn't to make sure the auditor has an auditor, it's to hire an actual auditor instead of someone you mistakenly think is one.

-> Imagine if you buy insurance, but actually the company sometimes will just spend lawyers on defending having to pay out, even when the event clearly happened and you were clearly covered. That's not insurance - that's a scam. You shouldn't have to insure the layer of insurance with an insurance against the insurance company out-lawyering you. You should get an actual insurance policy.

-> Imagine if you buy a seatbelt, but after buckling it, there is a realistic chance that you really haven't, and it's just a clothing item draped across your body and not attached in any way at any point.

Well if that's possible, that's just not a seatbelt. It's a defective item that was supposed to be a seatbelt but isn't.

The point is, all these examples are optional layers on TOP of a process. If they have a realistic chance of failing as in the above descriptions, they simply are not what they're claiming to be. Their chance of failure should be so low you can't even think about it; if it isn't, you should just hire or buy a different on, since you made a mistake.

link

eli 5172 days ago

I don't know what you're getting at, but I think you are underestimating the chance that your RAID controller will fail.

link

its_so_on 5172 days ago

Actually, I don't have a RAID controller. But at least I don't think I have one!

If I did get one, I wouldn't get one that tended to silently fail, since that would pretty well defeat the purpose of thinking my disks were redundant, wouldn't it?

link

koeselitz 5172 days ago

"Tend to fail silently" is different from "could fail silently at some point, though it is unlikely." You are correct that RAID controllers shouldn't be the former; however, there is absolutely no way to prevent RAID controllers from being the latter.

link