|
|
|
|
|
by andy4blaze
2131 days ago
|
|
Andy from Backblaze here. Nice thinking about the bulk ordering and considerations for RAID. All things we have considered. We use our own Reed-Solomon encoding with a 17/3 set-up across 20 drives across 20 different systems, we call that a Tome. Then we have a specific protocol we follow as drives fail in a Tome to protect the data at all costs. We have the luxury for example to stop writing to a given Tome as we have plenty of others available. This takes a lot of the stress off of the system.
Your thoughts on bulk buys and bad drive batches/models is solid. We test drives in small batches first, and we follow drive failures so we don't get to the point of hitting the wall. It would be great to mix and match drives, but you end up with a system that maxes out at the least performant drive. So not optimal. |
|
There is a failure mode in disks which can be modelled as something fails on the drive, but it continues to work fine for maybe months or years until the next power cycle, upon which it then won't work.
Obviously that's a problem for redundancy schemes because you think you have plenty of redundancy till there is a power outage and suddenly loads of drives fail at once.
I have never seen any of your reports measuring or reporting on these 'fail after power cycle' events, which is surprising.