Hacker News new | ask | show | jobs
by dap 4950 days ago
Great post, showing admirable dedication to software reliability and a solid understanding of memory issues.

One of the suggestions was that the kernel could do more. Solaris-based systems (illumos, SmartOS, OmniOS, etc.) do detect both correctable and uncorrectable memory issues. Errors may still cause a process to crash, but they also raise faults to notify system administrators what's happened. You don't have to guess whether you experienced a DIMM failure. After such errors, the OS then removes faulty pages from service. Of course, none of this has any performance impact until an error occurs, and then the impact is pretty minimal.

There's a fuller explanation here: https://blogs.oracle.com/relling/entry/analysis_of_memory_pa...

3 comments

I don't think enough people appreciate just how awesome of an OS Solaris was. I never had opportunity to deploy it full-scale for any projects, but I lamented the loss of great potential when it "died."
It didn't die. It was forked by the community when Oracle close-sourced it. The community fork (called illumos) is being actively developed by multiple companies, which have done significant new feature work (e.g., http://dtrace.org/blogs/wdp/2011/03/our-zfs-io-throttle/).
The first and only time I used Solaris, I tried to run our application and got the error "System out of colors" or some such. Swore then and there never to use it again if I could help it.
Linux kernels also do that memory thing (and have done it for years), just FYI. Either through MCE or EDAC. Not really that special.

Pro-tip: use ECC memory on servers. The end.

Thank you for the interesting link dap.
I take it you know about /var/log/mcelog ?