Hacker News new | ask | show | jobs
by starman100 2564 days ago
That's our impression, too. There are AMD Motherboards that take ECC memory, but we've never seen them act on ECC errors that were uncorrectable (the correctably errors are handled, but uncorrectable errors aren't reported!).

We will only use Intel Xeon for our work because of this. You'll get about 1 bit flip/GB/year. With 128 GB or more in our standard builds, this would be more than 2/week. We just can't have that uncertainty in the data we provide.

And while Cinebench is a useful benchmark, all our heavy number crunching is done on NVidia 2080 architecture so the fact that AMD may have an advantage on some cases isn't that interesting for us. Perhaps if you're a gamer, who doesn't care about an occasional bitflip, looking to squeeze the last drop of value out for his dollar....

6 comments

If you want to compare to Xeons, the comparison aren't the consumer CPUs this is about, the AMD equivalent are the Epyc CPUs.

AMD doesn't disable ECC support entirely on consumer CPUs like Intel does, but as far as I know it's also not officially supported and guaranteed to work, it's up to the mainboard vendor how to handle this. In the Intel case you simply can't get ECC with non-Xeon CPUs.

Well even on intel, its possible the firmware/os isn't doing the right thing. This was pretty common ~10 years ago, when the default linux behavior wasn't to report soft errors in the logs (due to missing drivers/whatever) so a lot of machines might just sit there and correct the errors and the only way to find out was to turn up some BMC/etc logging. I guess that is why you should buy machines from HP/Dell/Lenovo that are fully certified for your OS rather than random whitebox manufactures too, although given the problems I was having with HP equipment at the time its questionable.
> can't get ECC with non-Xeon CPUs

Not quite. There are some Core branded CPUs that support ECC, including funnily enough the i3's.

> There are some Core branded CPUs that support ECC, including funnily enough the i3's.

Hell, there are Celeron and Pentium chips that they have it enabled on. Not because they expect desktop users to buy them, but because it allows them to keep their Xeon brand premium while letting OEM's like Dell advertise the T140 "starting at $549" (in a configuration nobody would ever want to buy).

It depends on the use. For a home server or even a small office fileserver, you don't need massive threading capability, and in fact some of those low-core-count parts are fairly highly clocked, which makes them faster.

For example in the 7000 series, the i3 7100 has a 3.9 GHz base clock and you have to go almost to the top Xeon (the equivalent of an i7) to get anything equivalent. And even then it's a turbo, not a base clock, so in principle the motherboard should not let you turbo forever (PL2 time limit may actually be enforced on a server chipset).

Also depending on workload you may not even be able to exploit an increased threading capability anyway, without 10 GbE on the box, or link aggregation capability.

Oh, the i3's are fine for a general small business workload - compared to the socket-compatible Xeon's all they're really lacking is extra PCIe lanes if you need them. That's ultimately what Intel uses to segregate the Xeon and HEDT chips from their mainstream platform, after all.

The Celeron and Pentium chips that have infiltrated entry-level servers are absolute trash though.

There are even Atom chips with ECC: https://ark.intel.com/content/www/us/en/ark/products/97935/i... (for I assume the NAS products that use these?)

It seems like it's mostly any chip that would compete with the Xeon-W gets ECC removed.

There are no Epyc workstations chips with clock speeds comparable to Threadripper/Xeon-W. At least for the currently released products. And thus I consider Threadripper the closest competition to Xeon-W, not Epyc. AMD also lists ECC memory support as a feature for Threadripper.

That said, the new Xeon-W series has more memory channels (6) and supports more RAM (up to 2 TB) than any existing Threadripper product. I.e., AMD doesn't have an equivalent product for all use cases yet.

However, we don't know the Zen 2 Threadripper lineup and the frequencies for the different Zen 2 Epyc SKUs are also not public yet. AMD could release Threadripper with support for RDIMM/LRDIMM or Epyc chips with higher clock speeds to better compete against Xeon-W.

Supermicro lists tested ECC memory for their AMD mainboards. It would be very strange if that did not work.
The "average" in no way reflects the reality of any given machine. I've been running ECC ram in my NAS boxes at home for 20+ years (I put ECC on a AMD K6-II.). Not once have I seen any of those machines ever report correctable errors (outside of testing to inject errors I usually perform before putting them in service). Similarly at work i've had the opportunity to pull BMC/etc logs from a lot of machines over the past decade or so. The vast majority of machines never report any errors. Really rarely a machine will crop up that will report a soft error on some longer cycle (say every 3-5 weeks). Probably roughly at the same rate there are the machines that have obviously failed in some way. They go from functional to hard errors pretty much overnight, with some generally < couple days of warning where the soft errors were being corrected.

Both cases are hardware errors of some form because usually swapping ram/motherboard/powersupply/etc will clear it up.

You've _never_ seen a correctable error reported in 20 years? I think your AMD motherboard isn't handling these errors right. That's much more likely than you've never had 1 in 20 years.

See google's study: https://static.googleusercontent.com/media/research.google.c...

Not on the chain of hardware I run at home (the machines with ECC are ones I spec and configure very conservatively), on other larger collections of machines, sure...

I've seen googles study, and out of the few thousand or so machines I've had statistics collections from, the few machines with soft errors were fixable and stopped reporting soft errors after having something swapped.

The google study itself goes on and on about the variability of errors with such wonderful sections as "These numbers vary greatly by platform. Around 20% of DIMMs in Platform A and B are affected by correctable errors per year, compared to less than 4% of DIMMs in Platform C and D."

The paper really leaves a lot of holes, I don't remember (nor do I see after skimming it) any note of how aggressively they are running the ram. Did they say try to reduce the ram timings/bump voltage on the platforms they were having issues with? Did they compare how mature the technology was when the commissioned it? Did they try to diagnose the machines reporting high error rates by seeing if they could convert a machine with a high error rate to something lower? They do spend a lot of time talking about temp though. The only valid conclusion I think can be drawn from the paper is "ECC is important use it because you will have RAM failures, better to know about it than not".

To me the paper speaks to googles diagnostic/repair system more than anything. I took a proactive approach and replaced DIMMs/Motherboards/Powersupplies/etc that reported correctable errors. When we were self supporting we would swap the questionable parts into other machines to see if the failures would follow them in an attempt to see if we could prove a failing part was marginal. Then return/exchange it if it failed in more than one machine.

I've seen a lot of different failures over time, and when I was partially in charge of designing/picking platforms I even managed to find actual design bugs a couple times that caused low rate error rates (not in the RAM subsystem thankfully). I tended to use the "any kind of failure when run normally is instant disqualification" metric when I was initially picking new platforms before buying them to put in production. I would never have qualified a platform that had a 20% DIMM failure rate. (well at least not purposefully, we got some stinkers but we tried to correct our mistakes).

Given what i've heard of google, i'm not sure I would really extend these reliability metrics unless your buying the latest bleeding edge parts and running them well into their design margins. These days its pretty common to design systems that have error correction and push the physical topology to the point where there is an expectation of a pretty solid error rate (think SSD flash chips). So for a company like google pushing the RAM timings/etc right out to the margin where they are experiencing a low but statistically unlikely error rate would seem to be the right thing to do. Its different if your a bank/etc running financial data. In that case you buy for reliability first.

> There are AMD Motherboards that take ECC memory, but we've never seen them act on ECC errors that were uncorrectable (the correctably errors are handled, but uncorrectable errors aren't reported!).

That's incorrect. Uncorrectable errors are properly reported to the OS. Wendell from Level One Techs has tested this: https://www.reddit.com/r/Amd/comments/b1qmgy/ars_technica_th...

Yea the new Xeon E-21XX's are proving hard to find in the consumer market. Most places are back ordered or just place orders directly with intel after you buy. The scalpers are changing $50+ or more over retail on Ebay and 3rd party Newegg/Amazon sellers.
I don't understand. At that rate your chance of having a second bit flip before the first is fixed is almost zero, and the chance of hitting two separate bit flips in the same row is ridiculously small.

If you're worried about a single event causing two flips in a single row... I suppose that's possible, but it could also cause three bit flips. So a Xeon has a non-zero error rate. Is Ryzen meaningfully worse?

I think his argument is that you will get bit flips, ECC is just going to report and/or correct them. Without it, your hoping the bit flips show up somewhere you can detect them (application crash/etc) rather than silently chugging along and ruining your results/data/whatever.

I had the chance a long time ago to work on a product that as a side effect was corrupting system memory... Think of it as a kernel module that picks a random number between 0 and MAX_RAM and flips a byte. Its truly amazing how many of those can happen before there is any visible evidence something is wrong.

You're talking about ECC vs. no ECC. That's not what the comment was saying, it was saying it handled single bit flips correctly but not double bit flips. But at 1 bit flip per GB per year, randomly distributed, you are guaranteed many single bit flips but a double bit flip is almost never going to occur.