| This reminds me of testing I did years ago on ... CD-ROMs. Funny how lessons from old technology can apply to new technology. Around 15 years ago my company did a Linux distribution on CDs: KRUD. It was updated monthly, and we had something like 400 subscribers. For various reasons we burned these CDs in house on a cluster I built. We would burn, eject, read and checksum, and if the read test succeeded we would ship it out. We found some users with some discs had problems reading them. We contacted these users and paid them to return the CDs and did further testing on them. Our initial test was using dd, and we found that the discs that were not obviously damaged in shipping, would tend to pass tests on some of our CD-ROM drives, but fail on others. But when they did succeed, they would tend to take longer than normal. I wrote a new test program that instead of using dd directly used SCSI read commands, and timed every one. It would then count the number of reads that were "slow" (like 2x normal) and those that were "really slow" (like 5x), and if these got over a certain threshold we would throw away the disc. Being able to time the raw operations was incredibly useful, and seems like it could have shown the authors of this paper problems before being deployed to production. Except, they didn't really seem to do very thorough testing of the drives. Running stress testing on a 1TB drive for an hour seems pretty short. Also in my above job we did hosting. We found that if we burned in disks by reading/writing to them 10 times ("badblocks -svw -p 10"), we would almost never experience drive failures on the Hitachi drives we were using. If we didn't do this, the drives would have a fairly high chance of falling out of the RAID array in production. As drive sizes increased from 20GB to 200GB to 1TB, these tests started taking weeks to complete. But, they were totally worth it. |