This is one reason why RAID 0+1 is a best practice, and RAID 5 & 6 are no longer recommended. It takes too long to rebuild the array, leading to a multi-failed disk situation.
RAID6 should be fine rebuilding online (in RAID5 mode) even under a moderate write load.
Of course one should source RAID disks form 3 different vendors, to ensure that they are from different batches, and are not going to fail at approximately the same time.
Good advice, though I once had about half a dozen drives (12 drive RAID Z2 with 2 as hot spares) fail within a few weeks of each other in separate batches from sourcing. (Seagate 3TB drives, I think there's been articles on how bad that series was).
I don't know how i survived those Seagates. Lasted maybe a year and started dropping like flies. Synology seems to recommend identical drives as i recall but work fine with different sizes and makers afaict.
CRUSH is an example (and not the first) of a "distributed rebuild" approach: you have an array of N drives (with N large, e.g. 100), and if 1 drive fails, you read in parallel from all (N-1) remaining drives, while distributing the reconstructed data across the remaining available capacity of all (N-1) remaining drives.
In effect, you get the total bandwidth of (N-1) HDD's working in parallel. And the bandwidth of 100 HDD's doing sequential IO in parallel is really massive ( ~ 10 GB/s).
Examples of companies claiming to use this approach are Qumulo (rebuild in couple of hours), Infinidat (couple of 10's of minutes), ClusterStor GridRAID (now part of Seagate I think), or "Declustered RAID" in GPFS (IBM)
as opposed to raid 5 where if any two disks fail your array is toast, raid 6 increases this to 3.
However both raid 5 and 6 have 2 huge problems:
Data inflight at write time (power/hardware failures are more likely to corrupt the array, especially silently, which is the worst outcome).
Parity calculations require you to spin up the whole raid5/6 array during a rebuild, massively increasing the chance of a multi drive failure and a lost array. If one close-to-EOL drive dies, putting its sister drives through what is essentially an all day full tilt stress test is a terrible, terrible idea, and this idea keeps getting worse (takes longer) as drive sizes grow.
raid 0+1 sidesteps these issues mostly at a modest increase in drive count, its a no brainier for most setups.
Data inflight at write time (power/hardware failures are more likely to corrupt the array, especially silently, which is the worst outcome).
How is that? RAID doesn't affect data persistence behavior in any meaningful way. FUA/SyncCache/etc are supported by RAID controllers same as the underlying disks in writeback enviroments, parity updates included. Put another way, if you FUA or flush the writeback cache, those operations won't complete in a properly implemented RAID environment until the data is persisted somewhere, even if that means passing FUA down to the underlying storage. Granted there are a number of ways to mess this up, RMW cycles in a controller that doesn't have some kind of persistent memory and flush on power restore. Anyway, none of this is any worse than what happens in any other WB cached storage technology.
Finally, all this fearmongering about loss on rebuild is also something that should be more fully explored in the context of the fact that decent RAID systems run background scrub operations on a regular basis. Those operations by themselves are going to "stress test" the array on a regular basis when its consistent and not degraded. I've actually got a fair amount of experience in this area, and I'm here to tell you that if you think this is a risk consider what happens to non-raided unscrubbed drives that have a lot of data silently bitrotting on the platters. That latter effect is nearly always the problem in RAID environments when someone starts a rebuild on drives/sectors that have been unread for extended periods of time. But, in the case of RAID, a properly implemented system won't fail a drive for a single read failure during a rebuild, instead reconstructing from the other drives and leaving the drive online long enough to complete the rebuild and then taking it offline.
Basically raid 1 setups don't actually fix any of these problems, except through the use of massive additional parity disks overhead. Overhead that can also be applied to other RAID algorithsm to much better effect. AKA a mirrored RAID 6 provides far more protection than a mirrored raid 0. Similar levels can be had with 6+6 in environments where that is possible, with trivial capacity overhead.
Raid 5/6 require parity calculations before data can be written to disk. This is a significant amount of data, especially at high writing speeds. That is what causes the inflight data problem.
Battery and flash backup on controllers dosen't fix the problem of hardware failure (which is significant, especially on big hot controllers.
Again, decent controllers have ECC protection and the like, and frequently are available in HA configurations if your worry is controller failure (along with redundant/dual data paths to the media via SAS/NVMe/etc). Plus, there are a long list of technologies that can be enabled at the HBA layer and pushed all the way to the media (T10 DIF/DIX comes to mind).
But much of this micro level redundancy is overkill as frequently one uses some kind of application level HA/redundancy as well. So, loss of a RAID5/6 disk in a single machine is the functional equivalent of loss of a any combination of RAID 0/1 in the same machine. You still need the higher level redundancy as well as a backup plan.
We could start breaking the discussion up into fabric attached vs direct attach RAID vs Software, but I think its sufficient to say, that RAID5/6 doesn't _increase_ the failure surface in any meaningful way when your not using fly-by-night RAID.
Edit: Maybe what your trying to say is that cache flush/FUA operations for a give piece of data don't cover the parity calculation and buffers? That is false, a controller should not be responding to FUA/etc until the entire (including the parity) block has been persisted. So if the controller dies during the operation the host OS is fully aware that the operation didn't complete. The given block is of course left in some unknown state in this case, but that is true of any write operation that fails like this, regardless of WT/WB/RAID/etc.
The biggest problem with raid5 is that it is completely unprotected against silent corruption -- because there is no way for raid to know which data is the corrupted one (and as a result it has to decide whether the parity is correct or not -- though on most raid implementations just ignore silent corruption completely and so the parity is always assumed to be wrong in such cases).
So even if you rebuild an array, a bad drive might've blown away all of your data already. If you were to compare this with ZFS' "raid" Z1 (same parity, different design) you get detection and protection against silent data corruption.
>through what is essentially an all day full tilt stress test is a terrible, terrible idea
The rebuild isn't putting the disks under stress. The sister drive has already failed silently but you only notice this once you start the rebuild. The solution is to check the disks once a week by fully reading every sector.
The normal answer here is to make sure that each side of the RAID10 (RAID01 is something different and much less common) mirror uses drives from a different vendor, thus giving each side a different bathtub curve / failure rate and mitigating the impact of a bad batch. This is a nice advantage over parity-based setups like RAID6 (since replicating this with RAID6 would require finding a unique vendor for each array member, and there are only so many vendors).
For archival purposes, though, you're probably better off with a normal RAID1 + some kind of JBOD setup (like with LVM); striping makes data recovery more difficult should you indeed lose all RAID1 sides of a given member.
You can upgrade a 2-disk RAID1 to a 3-disk RAID5, then chain them to RAID0 as normal. It gives you a better chance to keep data intact, hopefully without lowering the write speed seriously.
Of course one should source RAID disks form 3 different vendors, to ensure that they are from different batches, and are not going to fail at approximately the same time.