Hacker News new | ask | show | jobs
by papercruncher 4911 days ago
I dealt with this exact problem for a number of years. Background scrubbing takes away I/O resources and can be a disaster on your workload if you rely on sequential reads/writes. For that reason, most controllers are configured by default to only scrub when the disk is totally idle which is never. Even if the controller had a better definition of idle, scrubbing an entire disk to find those rotten bits would take a long long time, a disk would almost certainly fail before that.
2 comments

I use the built in SMART full disk check. It's quite good at only reading when the disk is idle, and it checks the entire disk.

A quick self test every day for all disks, and a long (i.e. full read) self test once a week.

The RAID is then checked on top of that one a month (although that slows things down a bit).

With sufficient redundancy available, could you temporarily take a drive out of the RAID for scrubbing, and then add it back in when you're done, to avoid conflicting with ongoing work and destroying linear access patterns?
The rebuild would be worse than the scrubbing.

A better plan is to light up your disaster recovery plan weekly, and while the DR system is handling the load, scrub to your hearts content on the down system.

Depending on the cost of your hardware vs the cost of your labor vs the cost of downtime, dual servers, one flagged as production and one flagged as development, alternate flags every weekend, might work out. You'll hear lots of bragging about that not being possible because the hardware is too expensive, not so much bragging about labor cost and downtime cost. I worked at financial services corp about two decades ago where downtime was supposedly in excess of $1M/hr. They had triple mainframes set up, basically three machine rooms inside the machine room.