|
|
|
|
|
by brianwski
2230 days ago
|
|
Disclaimer: I work at Backblaze. > Our failure rates were consistently much higher compared to Backblaze That's interesting. > The overall workload on the clusters was extremely heavy That is the most likely explanation. We see higher failure rates when we are writing to the drives, for example as a new vault fills with customer data. Backup (our oldest business) has a fairly easy work load, it is not as punishing as a database that is pummeling the drives. In certain circumstances when we are down more than 1 drive in a 20 drive Reed Solomon group we stop putting new data on that drive group until the parity is restored explicitly because this lowers the chances of an additional drive failing in the group. That gives us more time to rebuild the parity with less stress in our lives. When that last parity drive fails and one more drive failure means customers lose data the fun drains right out of this job. Red alerts are thrown, pagers go off in the middle of the night, datacenter techs start driving towards the datacenter at 3am to replace drives. We prefer the world nice and calm and relaxed with a good night's sleep. |
|
/me goes and checks BackBlaze's careers page...
(I suspect driving time to any of their datacenters from Sydney puts me out of the running... At least their 3am emergencies would be a much more reasonable 8pm emergency from here.)