Hacker News new | ask | show | jobs
by mililani 4292 days ago
This may be a little off topic, but I used to think RAID 5 and RAID 6 were the best RAID configs to use. It seemed to offer the best bang for buck. However, after seeing how long it took to rebuild an array after a drive failed (over 3 days), I'm much more hesitant to use those RAIDS. I much rather prefer RAID 1+0 even though the overall cost is nearly double that of RAID 5. It's much faster, and there is no rebuild process if the RAID controller is smart enough. You just swap failed drives, and the RAID controller automatically utilizes the back up drive and then mirrors onto the new drive. Just much faster and much less prone to multiple drive failures killing the entire RAID.
7 comments

This can not be stressed strongly enough. There is never a case when RAID5 is the best choice, ever [1]. There are cases where RAID0 is mathematically proven more reliable than RAID5 [2]. RAID5 should never be used for anything where you value keeping your data. I am not exaggerating when I say that very often, your data is safer on a single hard drive than it is on a RAID5 array. Please let that sink in.

The problem is that once a drive fails, during the rebuild, if any of the surviving drives experience an unrecoverable read error (URE), the entire array will fail. On consumer-grade SATA drives that have a URE rate of 1 in 10^14, that means if the data on the surviving drives totals 12TB, the probability of the array failing rebuild is close to 100%. Enterprise SAS drives are typically rated 1 URE in 10^15, so you improve your chances ten-fold. Still an avoidable risk.

RAID6 suffers from the same fundamental flaw as RAID5, but the probability of complete array failure is pushed back one level, making RAID6 with enterprise SAS drives possibly acceptable in some cases, for now (until hard drive capacities get larger).

I no longer use parity RAID. Always RAID10 [3]. If a customer insists on RAID5, I tell them they can hire someone else, and I am prepared to walk away.

I haven't even touched on the ridiculous cases where it takes RAID5 arrays weeks or months to rebuild, while an entire company limps inefficiently along. When productivity suffers company-wide, the decision makers wish they had paid the tiny price for a few extra disks to do RAID10.

In the article, he has 12x 4TB drives. Once two drives failed, assuming he is using enterprise drives (Dell calls them "near-line SAS", just an enterprise SATA), there is a 33% chance the entire array fails if he tries to rebuild. If the drives are plain SATA, there is almost no chance the array completes a rebuild.

[1] http://www.smbitjournal.com/2012/11/choosing-a-raid-level-by...

[2] http://www.smbitjournal.com/2012/05/when-no-redundancy-is-mo...

[3] http://www.smbitjournal.com/2012/11/one-big-raid-10-a-new-st...

Note that the 10^14 figure is only what the HDD mfgs publish, and it has been the same for something like a decade. It's a nice, safe, conservative figure that seems impressively high to IT Directors and accountants, and yet it's low enough that HDD mfgs can easily fall back on it as a reason why a drive failed to meet a customers expectations.

In reality you'll find that drives perform significantly better than that, arguably orders of magnitude better.

That said, I'm still not a fan of RAID5. Both rebuild speed and probability of batch failures of drives (if one drive fails, the probability of another drive failing is significantly higher if they came from the same batch), make it too risky a prospect for my comfort.

10^14 bit error rate is false or routine ZFS scrubs would produce documented read errors. I'm inclined to believe the entire math here is wrong as well.
What I don't understand about the URE is from [2]. If you have a 12TB raid 5 array and you need to rebuild. If 10^14 approaches an URE at around 12TB of data as the article says. What causes it to hit 12TB? Each disk has 10^14 which is 100TB. If you had 12TB from 4x3TB disks it should have alot to go through.
10^14 is bits. When you divide 10^14 by 8 you get 12.5 trillion bytes, or 12.5 TB.

If you have 4x 4TB drives in RAID5, and one fails, then in order to rebuild with a replacement drive, you have to read all data from all surviving drives (3 x 4TB = 12TB).

Here is an example from a manufacturer:

http://www.seagate.com/files/www-content/product-content/con...

They call it "Non-recoverable Read Errors per Bits Read" and then list "1 sector per 10^15". So for every 10^15 bits read they expect 1 sector to be unreadable.

OK i see now but if it needs to read 12TB from all drives together then that's still far from 12.5TB per drive which is the limit. That's where I am confused.
Say we have a department of a company with 36 employees, and one pair of dice. We decide that if any person out of the entire department rolls a 12, then everyone in the department will be fired. The chance of rolling a 12 is 1/36. It doesn't matter if one person keeps rolling the dice, or if they take turns, the chances of everyone being fired are close to 100%.

The same is true for a disk array. Each read operation is an independent event (for the purpose of doing this math). The chance of one URE happening is 1/(10^14) for every bit read. It doesn't matter which disk it happens on. When it happens, the entire array is failed.

Also 12.5 TB is not a hard limit, just an average. The URE could happen on the very first read operation, or you might read 100 TB without a URE.

I think your calculation on failing an array rebuild is wrong. Can you show how you got those numbers?
Sure, there were two statements I made.

>On consumer-grade SATA drives that have a URE rate of 1 in 10^14, that means if the data on the surviving drives totals 12TB, the probability of the array failing rebuild is close to 100%.

10^14 bits is 12.5 TB, so on average, the chance of 12TB being read without a single URE is very low, and the probability the array fails to rebuild is close to 100%. I was estimating 10^14 bits to be about 12TB, so the probability is actually 12/12.5 = 96% chance of failure.

>...he has 12x 4TB drives. Once two drives failed, assuming he is using enterprise drives...there is a 33% chance the entire array fails if he tries to rebuild. If the drives are plain SATA, there is almost no chance the array completes a rebuild.

A RAID6 with two failed drives is effectively the same situation as a RAID5 with one failed drive. In order to rebuild one failed drive, the RAID controller must read all data from every surviving drive to recreate the failed drive. In this case, there are 10x 4TB surviving drives, meaning 40TB of data must be read to rebuild. Because these drives are presumably enterprise quality, I am assuming they are rated to fail reading one sector for every 10^15 bits read (10^15 bits = 125 TB). So it's actually 40/125 = 32% chance of failure if you try to rebuild.

There is even a community for it: BAARF - http://www.miracleas.com/BAARF/

The main reason is not only the rebuild time (which indeed is horrible for a loaded system), but also the performance characteristics that can be terrible if you do a lot of writing.

We for example more than doubled the throughput of Cassandra by dropping RAID5. So the saving of 2 disks per server in the end required buying almost twice as many servers.

Wouldn't rsync of been a better and more reliable choice for this?
The problem with rsync is that it would have to traverse all the FS tree and check every single file on both sides for the timestamp and the file. In a relatively small FS tree is just fine, but when you start having GBs and GBs and tens of thousands of files, it becomes somewhat impractical.

Then, you need to come up with other creative solutions, like deep syncing inside the FS tree, etc. Fun times.

Add checksumming, and you probably would like to take a holiday whilst it copies the data :-)

if you want to copy everything and there's nothing at the target:

rsync --whole-file --ignore-times

that should turn off the metadata checks and the rsync block checksum algorithm entirely and transfer all of the bits at the source to the dest without any rsync CPU penalty.

also for this purpose it looks like -H is also required to preserve hard links which the man page notes:

"Note that -a does not preserve hardlinks, because finding multiply-linked files is expensive. You must separately specify -H."

be mildly interesting to see a speed test between rsync with these options and cp.

there are also utilities out there to "post-process" and re-hardlink everything that is identical, so that a fast copy without preserving hardlinks and then a slow de-duplication step would get you to the same endpoint, but at an obvious expense.

I use rsync to replicate 10TB+ volumes without any problems. It's also very fast to catch up on repeat runs, so you can ^C it without being too paranoid.
It sounds like OP had large hard-link counts, since he was using rsnapshot. It's likely that he simply wouldn't have had the space to copy everything over to the new storage without hard-links.
You cannot exaggerate what rsync can do... I use it to backup my almost full 120gb Kubuntu system disk daily, and it goes through 700k files in ~ 2-3 minutes. Oh, and it does it live, while I'm working.
To be fair: the case described here is two magnitudes larger in both directions than your use case.

And the critical aspect from a performance perspective was where the hash table became too large to fit into memory. Performance of all sorts goes pear-shaped when that happens.

rsync is pretty incredible, but, well, quantity has a quality all its own, and the scale involved here (plus the possible in-process disk failure) likely wasn't helping much.

Yep, my answer when cp or scp is not working well is always to break out rsync, even if I'm just copying files over to an empty location.

I've had good luck with the -H option, though it is slower than without the option. I have never copied a filesystem with nearly as many files as the OP with -H; the most I've done is probably a couple million, consisting of 20 or so hardlinked snapshots with probably 100,000 or 200,000 files each. rsync -H works fine for that kind of workload.

> In a relatively small FS tree is just fine, but when you start having GBs and GBs and tens of thousands of files, it becomes somewhat impractical.

I regularly rsync machines with millions of files. On a slow portable 2.5" 25MB/sec USB2 connection it's never taken me more than 1hr on completely cold caches to verify that no file needs to be copied. With caches being hot, it's a few minutes. And on faster drives it's faster still.

Unless you are doing something weird and force it to checksum, checksumming mostly kicks in on files that have actually changed and can be transferred efficiently in parts (i.e. network; NOT on local disks). In other cases, it's just a straight copy.

Have you actually used rsync in the setting you describe, or at least read its manual?

> The problem with rsync is that it would have to traverse all the FS tree and check every single file on both sides for the timestamp and the file.

- In this scenario, the receiving side is empty, so there is no need to check every single file.

- Since 3.0, rsync walks the dirtree while copying (unless you use some special options). So, in some sense, rsync is already "deep syncing inside the FS tree", as you put it.

rsync since version 3 has stopped doing the full tree traversal It's not in RHEL/CentOS 5 but is in 6, and is really trivial to compile by hand anyway. That's made life a lot easier for me in the past.

Some of the other features in rsync would seem to make it a more logical fit for this task anyway, e.g. it's resumable, it supports on-the-fly compression (though the value of that depends on the types of files being transferred), native logging etc etc.

WE use 10 rsyncs in parallel to copy .5pb in less than 3 days.

CPU is cheap.

Have done the same, found that rsync seems to not natively parallelize itself, so spread across 20 cores, it really screamed.
Damn, I could use some advice there. Right now I have a system which uses rsync to backup some files. It takes about 8 hours to go over 7+ million files (~3.5TB). I'd love to speed up this. I should mention that copy is done over curlftpfs :(.
s/of/have
aint' nobody got time fo dat!
I had a RAID 6 have a failed disk a couple weeks ago... 8 x 3TB drives... 18TB usable. Took 7 hours to rebuild.

Disks were connected via SATA 3 and was using mdadm

Assuming it is reading and writing almost simultaneously that is ~125Mbyte/sec on each physical drive, which isn't bad considering the latency of the read-from-three-or-four-then-checksum-then-write-to-the-other-one-or-two process if you if the array was was in use at the time.

There are a few ways to improve RAID rebuild time (see http://www.cyberciti.biz/tips/linux-raid-increase-resync-reb... for example) but be aware that they tend to negatively affect performance of other use of the array while the build is happening (that is, you are essentially telling the kernel to give more priority to the rebuild process than it normally would when there are other processes accessing the data too).

In my experience working with Debian, the rebuild is sensitive to disk usage. If you start using the disks, the rebuild will slow down intentionally so as to not choke the drives. This could explain why it was fast for you but not for the parent.

Edit: To clarify, I was using mdadm, not hardware raid.

Yeah, I question that rebuild. If the array is being used along with the rebuild, typically the rebuild process gets put to a lower priority. I've seen IBM enterprise RAID 5 storage arrays take forever to rebuild if they were the main storage for a high transaction database.
I you can tune the speed when the disks are in use; I usually bump it to around 60MB/s as long as there isn't a live database on the disk; the default is something insanely slow, and it impacts read/writes almost as much as the higher value.
How does the rebuild time of ZFS's raidz compare to RAID5/6?
My largest ZFS pool is currently ~64TB ( 3 X 10 3TB (raidz2) ) The pool has ranged from 85%-95% full (it's mostly at 85% now and used mostly for reads).

Resilvering one drive usually takes < 48 hours. Last time took ~32 hours.

Something else cool: When I was planning this out I wrote a little script to calculate the chance of failure. With a drive APF of 10% (which is pretty high), and a rebuild speed of 50MB/sec (very low compared to what I typically get) I have a 1% chance of losing the entire pool over a 46.6 year period. If I add 4 more raidz2 10X3TB VDEVS that would drop to 3.75 years.

What is APF? And why not use the typical URRE rate to calculate your stats?

I'm always mystified at how stupid our storage systems are. Even very expensive SAN solutions from EMC and the like area just... stupid. We've got loads of metrics on every drive, but figuring out that those things should be aggregated and subjected to statistical analysis just seems to have not been done yet.

What I really want is a "pasture" system - a place I can stick old drives of totally random sizes and performance characteristics and have the system figure out where to put the data in order to maintain a specific level of reliability. Preferrably backed by an online database that tracks the drive failure rate of every drive on every deployed system, noting patterns in 'bad batches' of certain models and the like. If one of my drives would have to beat 3-standard-deviations odds to survive for the next week, move the damn data to somewhere better. And if you've got 2 150GB drives and 1 300GB drive, then each block on those drives has a rating of 2.0 - adjusted for the age and metrics of the drive.

Oh well, maybe when I retire in 30 years storage systems will still be as stupid as they've remained for the past 30 years and I'll have another project to add to the pile I don't have time to work on now.

btrfs at least offers the flexibility to use a collection of mismatched drives and ensure that N copies of the data exist across separate devices. Once it's parity-based RAID modes are stable, it should be able to retain that flexibility while offering a redundant option for cases where N=2 wastes too much space. That combined with regular scrubbing should suffice for moderately sized arrays.

Treating drives differently based on their expected failure rates seems like it would only matter for very large arrays trying to keep the redundancy as low as possible.

I have a 24TB RAIDZ3 array. Losing a disk out of that takes about 12-18 hours to rebuild.
It really depends on the load and if your raidz has 1,2, or 3 spare drives. From what I've experienced, re-slivering a the new drive in a raidz seems to happen at about 30% of the maximum ZFS throughput under medium load. And even better than many RAID5/6 implementations - it only re-slivers enough to store your data and not just the entire drive.
It is generally superior because it can skip blank blocks. RAID is byte-by-byte.
If you want to burn even more disks, 6+0 or 5+0 can be a nice way to cut down on the worst car senerio of losing data.
in my opinion, raid 10 is by far the best if you have to support the operations of a production site in a real world environment. just keep spare controllers and drives handy. it's just a matter of swapping out failures.

three cost levers: capacity, speed, and downtime.

basically, at some point you will begin to realize the cheapest cost is the capacity itself, everything else is an order of magnitude more difficult to justify. downtime and slowness are simply unacceptable, whereas buying more stuff is perfectly acceptable and quite feasible solution in an enterprise environment.

Exactly.

I work with a number of medium sized businesses of which you can clearly tell who have had and who havent had major downtime due to these types of hardware failures.

Many people who have never experienced issues have no interest in even forming a basic DR plan, but the ones who have just start throwing money at you the moment you mention it.

Buying more stuff is WAY cheaper than slowness and downtime, and penny pinching on this count is going to cause so much additional heartache for what will in the end cost more anyway.