Hacker News new | ask | show | jobs
by andy4blaze 2131 days ago
Andy from Backblaze here. Nice thinking about the bulk ordering and considerations for RAID. All things we have considered. We use our own Reed-Solomon encoding with a 17/3 set-up across 20 drives across 20 different systems, we call that a Tome. Then we have a specific protocol we follow as drives fail in a Tome to protect the data at all costs. We have the luxury for example to stop writing to a given Tome as we have plenty of others available. This takes a lot of the stress off of the system. Your thoughts on bulk buys and bad drive batches/models is solid. We test drives in small batches first, and we follow drive failures so we don't get to the point of hitting the wall. It would be great to mix and match drives, but you end up with a system that maxes out at the least performant drive. So not optimal.
3 comments

Do you power cycle the disks during regular operations?

There is a failure mode in disks which can be modelled as something fails on the drive, but it continues to work fine for maybe months or years until the next power cycle, upon which it then won't work.

Obviously that's a problem for redundancy schemes because you think you have plenty of redundancy till there is a power outage and suddenly loads of drives fail at once.

I have never seen any of your reports measuring or reporting on these 'fail after power cycle' events, which is surprising.

I recall reading about your triple redundancy with R-S codes, but it's good to restate it for each audience.

From the behavior of my RAID (which also uses Reed Solomon, doesn't it?) it feels like repairing an array takes time proportional to the size of the drive, not the size of the contents, and it feels like a waste. But it's possible that my comfort levels for available disk space are a lot more conservative than other people's, and so the difference is less pronounced in a 'normal' storage situation.

For instance, an array that's at 80% capacity takes 25% longer to rebuild than I wish it would, whereas an array that's at 66% capacity takes 50% longer.

> it feels like repairing an array takes time proportional to the size of the drive, not the size of the contents

This is only true for drive-level raid rather than filesystem level raid, or a non-raid solution like ceph's replication.

ZFS's filesystem raid can repair a raid in time proportional to the amount of data stored in it.

mdadm and raid controllers aren't aware of which parts of the block device are in use or not, and thus have to repair the whole drive.

It's exceedingly likely that backblaze's solution does not require repairing entire block devices, but rather is likely to be closer to ceph, where only the in-use portion of a failed drive must be considered / must find a new home.

I think raid and distributed storage systems (like backblaze or ceph) are more different than they are alike.

> From the behavior of my RAID (which also uses Reed Solomon, doesn't it?)

Maybe. mdadm raid5 doesn't, nor does mdadm raid1 or raid10. I think mdadm's raid6 does.

I'm interested that you don't use nested codes...

You could then for example have 18/2, and then group together 400 drives in a 2nd layer of 19/1. Hey, I reckon you could do 19/1,39/1, reducing your storage costs 7.5%...

Sure, the worst case rebuild cost is much worse, but overall data loss probability is far lower, and a 2nd layer rebuild is a very rare event, and in that case, your customers totally prefer a few extra seconds latency over an email reporting their data is lost...

I assume you mostly do streaming rather than random writes, so the overhead is evenly spread amongst the disks, and is the same 15% as your current scheme.