Hacker News new | ask | show | jobs
by mjb 5346 days ago
> ECC checking memory

There is a balance, and I don't feel that throwing out ECC memory is necessarily the right choice for the majority of server applications. Low hardware cost needs to be achieved with an holistic approach - simply buying the cheapest possible components is unlikely to lead to the lowest cost unless your software and datacenter designs are really specialized for it.

DRAM errors are rather common in real systems[1]. There are two big hidden costs to this. The first one is the risk of silent data corruption. Unless you are willing to write your software in a way that is very careful to check all calculations, you run the risk of getting the wrong answer. The other hidden cost is operational: memory errors are often difficult to diagnose and you have to pay a highly skilled human to do it, as well as lose the use of the server while it is being done.

It may be that buying ECC RAM decreases the cost-per-page reliably served of your entire operation. If you are Google scale then that may not be the case, but for nearly all smaller operations it is.

'Enterprise' type hard drives are another potential long-term saving by spending more up front. Having a human replace a disk, and having the server down for the time it takes for a disk to be replaced, is expensive. If you have a large number of disks, especially if you are sensitive to small numbers of IO errors, it may be worth paying more up front.

Using an external view of Google's architecture to say 'cheap hardware is always good' is too simplistic. Yes, there is good evidence that single-host reliability mechanisms like RAID might give a poor ROI. Yes, redundancy is a powerful way to get reliability. But, before you take this to the extreme, you need to have carefully designed applications, carefully designed datacenters, and extremely low per-host operations costs (probably through aggressive automation). Unless you have these things, the optimal cost-per-request server design for your company may be very different from the ones Google, Facebook and Amazon chose.

[1] http://www.cs.toronto.edu/%7Ebianca/papers/sigmetrics09.pdf

1 comments

Indeed. And even at Google scale where 'good enough' is an art form, ECC memory is deemed to be worth the money. The paper you cited does its study on Google hardware, and I can confirm that it's still used today.