| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by hobojones 2227 days ago
	I concur - I used to work with a set of clusters totaling ~45k drives and it was always fun to compare failure rates (especially on a per model basis).

1 comments

icefo 2227 days ago

Did you end up with significally different values sometimes ?

link

hobojones 2227 days ago

Our failure rates were consistently much higher compared to Backblaze, usually 25-30% higher. I never did any detailed analysis, but I'd chalk it up to a couple items. First, the stock of drives were not rotated out as frequently as I understand Backblaze does. The overall workload on the clusters was extremely heavy, but not consistent across all drives in the cluster, resulting in localized 'bursts' of activity. Finally the datacenter where the clusters lived used evaporate cooling and was located in a hot, dry environment surrounded by very fine soil. During the summer the machine room was Georgia level humid, and powering on any piece of hardware would usually yield a small cloud of dust.

link

brianwski 2227 days ago

Disclaimer: I work at Backblaze.

> Our failure rates were consistently much higher compared to Backblaze

That's interesting.

> The overall workload on the clusters was extremely heavy

That is the most likely explanation. We see higher failure rates when we are writing to the drives, for example as a new vault fills with customer data. Backup (our oldest business) has a fairly easy work load, it is not as punishing as a database that is pummeling the drives.

In certain circumstances when we are down more than 1 drive in a 20 drive Reed Solomon group we stop putting new data on that drive group until the parity is restored explicitly because this lowers the chances of an additional drive failing in the group. That gives us more time to rebuild the parity with less stress in our lives. When that last parity drive fails and one more drive failure means customers lose data the fun drains right out of this job. Red alerts are thrown, pagers go off in the middle of the night, datacenter techs start driving towards the datacenter at 3am to replace drives. We prefer the world nice and calm and relaxed with a good night's sleep.

link

bigiain 2227 days ago

> We prefer the world nice and calm and relaxed with a good night's sleep.

/me goes and checks BackBlaze's careers page...

(I suspect driving time to any of their datacenters from Sydney puts me out of the running... At least their 3am emergencies would be a much more reasonable 8pm emergency from here.)

link

tinus_hn 2227 days ago

How fast can you restore such an enormous dataset anyway? Drives have become so big a 20 drive group could hold almost 1/4 petabyte, it must take a long time to read data, recalculate parity and write.

link

brandmeyer 2227 days ago

Turns out that some outdoor environmental dust is ferromagnetic.

A particular vertical-axis wind turbine project was destroyed by buildup of magnetic dust on the generator magnets.

I'm not saying that's what was killing your drives directly. Normally HDDs are fully sealed. But the amount of dust you mention is awfully suspicious. It might change your investigative perspective a bit when you consider that some small percentage of all that dust is ferromagnetic.

link