| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by stephengillie 2832 days ago
	This is one reason why RAID 0+1 is a best practice, and RAID 5 & 6 are no longer recommended. It takes too long to rebuild the array, leading to a multi-failed disk situation.

5 comments

nine_k 2832 days ago

RAID6 should be fine rebuilding online (in RAID5 mode) even under a moderate write load.

Of course one should source RAID disks form 3 different vendors, to ensure that they are from different batches, and are not going to fail at approximately the same time.

link

stephengillie 2832 days ago

Do other manufacturers produce this size of drive? It's difficult to source from 3 vendors if there's only one making the product.

link

thfuran 2832 days ago

Get one from amazon, one from newegg, one from the manufacturer directly or some such.

link

lostapathy 2832 days ago

I try to buy hot spare or the last drive in a raid6 later than the rest of the array to try to spread them out too.

link

tracker1 2832 days ago

Good advice, though I once had about half a dozen drives (12 drive RAID Z2 with 2 as hot spares) fail within a few weeks of each other in separate batches from sourcing. (Seagate 3TB drives, I think there's been articles on how bad that series was).

link

tfigment 2832 days ago

I don't know how i survived those Seagates. Lasted maybe a year and started dropping like flies. Synology seems to recommend identical drives as i recall but work fine with different sizes and makers afaict.

link

jmpman 2832 days ago

CRUSH algorithms are used to overcome rebuild limits in modern arrays. https://www.ssrc.ucsc.edu/Papers/weil-sc06.pdf

link

kdkeyser 2832 days ago

CRUSH is an example (and not the first) of a "distributed rebuild" approach: you have an array of N drives (with N large, e.g. 100), and if 1 drive fails, you read in parallel from all (N-1) remaining drives, while distributing the reconstructed data across the remaining available capacity of all (N-1) remaining drives.

In effect, you get the total bandwidth of (N-1) HDD's working in parallel. And the bandwidth of 100 HDD's doing sequential IO in parallel is really massive ( ~ 10 GB/s).

Examples of companies claiming to use this approach are Qumulo (rebuild in couple of hours), Infinidat (couple of 10's of minutes), ClusterStor GridRAID (now part of Seagate I think), or "Declustered RAID" in GPFS (IBM)

link

pinewurst 2832 days ago

GridRAID is owned by Cray now, who were the primary OEM from Seagate.

Thanks for pointing out that declustered/distributed rebuild RAID has many historical precedents (also 3PAR BTW) pre-CRUSH/Ceph.

link

chrisper 2832 days ago

Raid 01 has its own risk. If the wrong two disks fail your entire array is toast

link

throwaway2048 2832 days ago

as opposed to raid 5 where if any two disks fail your array is toast, raid 6 increases this to 3.

However both raid 5 and 6 have 2 huge problems:

Data inflight at write time (power/hardware failures are more likely to corrupt the array, especially silently, which is the worst outcome).

Parity calculations require you to spin up the whole raid5/6 array during a rebuild, massively increasing the chance of a multi drive failure and a lost array. If one close-to-EOL drive dies, putting its sister drives through what is essentially an all day full tilt stress test is a terrible, terrible idea, and this idea keeps getting worse (takes longer) as drive sizes grow.

raid 0+1 sidesteps these issues mostly at a modest increase in drive count, its a no brainier for most setups.

link

StillBored 2832 days ago

Data inflight at write time (power/hardware failures are more likely to corrupt the array, especially silently, which is the worst outcome).

How is that? RAID doesn't affect data persistence behavior in any meaningful way. FUA/SyncCache/etc are supported by RAID controllers same as the underlying disks in writeback enviroments, parity updates included. Put another way, if you FUA or flush the writeback cache, those operations won't complete in a properly implemented RAID environment until the data is persisted somewhere, even if that means passing FUA down to the underlying storage. Granted there are a number of ways to mess this up, RMW cycles in a controller that doesn't have some kind of persistent memory and flush on power restore. Anyway, none of this is any worse than what happens in any other WB cached storage technology.

Finally, all this fearmongering about loss on rebuild is also something that should be more fully explored in the context of the fact that decent RAID systems run background scrub operations on a regular basis. Those operations by themselves are going to "stress test" the array on a regular basis when its consistent and not degraded. I've actually got a fair amount of experience in this area, and I'm here to tell you that if you think this is a risk consider what happens to non-raided unscrubbed drives that have a lot of data silently bitrotting on the platters. That latter effect is nearly always the problem in RAID environments when someone starts a rebuild on drives/sectors that have been unread for extended periods of time. But, in the case of RAID, a properly implemented system won't fail a drive for a single read failure during a rebuild, instead reconstructing from the other drives and leaving the drive online long enough to complete the rebuild and then taking it offline.

Basically raid 1 setups don't actually fix any of these problems, except through the use of massive additional parity disks overhead. Overhead that can also be applied to other RAID algorithsm to much better effect. AKA a mirrored RAID 6 provides far more protection than a mirrored raid 0. Similar levels can be had with 6+6 in environments where that is possible, with trivial capacity overhead.

link

throwaway2048 2832 days ago

Raid 5/6 require parity calculations before data can be written to disk. This is a significant amount of data, especially at high writing speeds. That is what causes the inflight data problem.

Battery and flash backup on controllers dosen't fix the problem of hardware failure (which is significant, especially on big hot controllers.

link

StillBored 2832 days ago

Again, decent controllers have ECC protection and the like, and frequently are available in HA configurations if your worry is controller failure (along with redundant/dual data paths to the media via SAS/NVMe/etc). Plus, there are a long list of technologies that can be enabled at the HBA layer and pushed all the way to the media (T10 DIF/DIX comes to mind).

But much of this micro level redundancy is overkill as frequently one uses some kind of application level HA/redundancy as well. So, loss of a RAID5/6 disk in a single machine is the functional equivalent of loss of a any combination of RAID 0/1 in the same machine. You still need the higher level redundancy as well as a backup plan.

We could start breaking the discussion up into fabric attached vs direct attach RAID vs Software, but I think its sufficient to say, that RAID5/6 doesn't _increase_ the failure surface in any meaningful way when your not using fly-by-night RAID.

Edit: Maybe what your trying to say is that cache flush/FUA operations for a give piece of data don't cover the parity calculation and buffers? That is false, a controller should not be responding to FUA/etc until the entire (including the parity) block has been persisted. So if the controller dies during the operation the host OS is fully aware that the operation didn't complete. The given block is of course left in some unknown state in this case, but that is true of any write operation that fails like this, regardless of WT/WB/RAID/etc.

link

cyphar 2832 days ago

The biggest problem with raid5 is that it is completely unprotected against silent corruption -- because there is no way for raid to know which data is the corrupted one (and as a result it has to decide whether the parity is correct or not -- though on most raid implementations just ignore silent corruption completely and so the parity is always assumed to be wrong in such cases).

So even if you rebuild an array, a bad drive might've blown away all of your data already. If you were to compare this with ZFS' "raid" Z1 (same parity, different design) you get detection and protection against silent data corruption.

link

imtringued 2831 days ago

>through what is essentially an all day full tilt stress test is a terrible, terrible idea

The rebuild isn't putting the disks under stress. The sister drive has already failed silently but you only notice this once you start the rebuild. The solution is to check the disks once a week by fully reading every sector.

link

yellowapple 2832 days ago

The normal answer here is to make sure that each side of the RAID10 (RAID01 is something different and much less common) mirror uses drives from a different vendor, thus giving each side a different bathtub curve / failure rate and mitigating the impact of a bad batch. This is a nice advantage over parity-based setups like RAID6 (since replicating this with RAID6 would require finding a unique vendor for each array member, and there are only so many vendors).

For archival purposes, though, you're probably better off with a normal RAID1 + some kind of JBOD setup (like with LVM); striping makes data recovery more difficult should you indeed lose all RAID1 sides of a given member.

link

nine_k 2832 days ago

You can upgrade a 2-disk RAID1 to a 3-disk RAID5, then chain them to RAID0 as normal. It gives you a better chance to keep data intact, hopefully without lowering the write speed seriously.

https://en.wikipedia.org/wiki/Nested_RAID_levels#RAID_50_(RA...

link

zaarn 2831 days ago

RAID 50 doesn't really solve it, it exposes you to some more risk since you can still die with 2 disks but now you have more disks in each sub array.

The correct answer is either a 3-mirror RAID1 or RAID6.

Bcachefs also promises some solution to this by allowing both erasure encoding and replication to co-exist, according to it's documentation.

link

stephengillie 2832 days ago

Multiple "0" drives can be added for further redundancy.

link

astrodust 2832 days ago

"Zero" drives are the ones that when you lose them you have zero data.

"One" drives are the ones with a copy.

link

stephengillie 2832 days ago

I haven't worked with arrays for years. Sorry for the mistakes.

link

ahoka 2832 days ago

Interesting. What do you think the advantage of raid01 instead of raid10? The latter looks safer at first sight.

link

stephengillie 2832 days ago

I get RAID 01 and 10 mixed up all the time. These names are too similar. Please understand that I meant the better of the 2.

link

Hei1Fuya 2832 days ago

ZoL 0.8 will have sequential resilver which should be able to restore a disk in a few hours.

link