Hacker News new | ask | show | jobs
by closeparen 2146 days ago
>hardware monitoring is significantly improved to the point where you’ll typically know if something will fail and can schedule the maintenance.

There's SMART for disks... what else?

2 comments

And multiple power supplies. I have been running a single physical server like this for ~10 years and the only downtimes were me restarting to boot a new kernel and when people at datacenter messed up BGP routing (their fault). HW is really very reliable now, especially in datacenter environment. But still not 100% of course. There is still low, but more lower than most think, probability of it failing. IC chips most likely won't break, only some capacitors degrade over time and flash memories with bios normally guarantee only 10 years. Bios upgrade (new write) would prolong that, though. I had one disk fail in RAID. Changed the drive without any downtime.
ECC for RAM is the other big one. A single-bit error will trigger warnings, so that you can replace the faulty DIMM before it progresses into uncorrectable errors.
Is there a tool that can randomly take 128mb chunks of memory out of the pool and test them around the clock?