| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by spystath 1018 days ago
	More RAM is always nice but I'm secretly hoping we'll start to see more ECC support in the future. With these humongous modules and even with a teeny tiny bitflip probability corruption chance becomes non insignificant.

8 comments

Aurornis 1018 days ago

These are individual memory chips. They can be used to build both ECC modules and non-ECC modules.

ECC modules just have more chips to store the extra parity information. In the high capacity RDIMM server market there are plenty of ECC options.

link

spystath 1018 days ago

Oh yes, I understand that. I only wish that ECC support in general starts getting more traction in consumer electronics. Nowadays (unless you go to super noisy super expensive server hardware) maybe with an AMD processor maybe a motherboard manufacturer will have a 20-links-deep document that says that these ECC modules may be supported, proceed at your own risk, might set your flat on fire, kill kittens etc. When you had a couple of gigs of ram it was probably irrelevant but if you have multiple TB of RAM caching file access ECC should become normalised.

link

Aurornis 1018 days ago

ECC isn’t that hard to get in consumer platforms now. The situation has changed a lot from what you’re thinking.

You can get ECC support on Intel 12th and 13th generation parts by buying a motherboard with a W680 chipset.

You can get ECC support on modern AMD CPUs by picking a motherboard that lists ECC support listed on the product page. It’s not that hard.

link

delfinom 1018 days ago

Yea, the biggest downfall of ECC in computers was Intel intentionally disabling ECC in dies uses for the consumer processors and leaving it only for Xeons. As a way of forcefully keeping the market segregated.

AMD otoh has brought ECC to the table in Ryzens without the same shenanigans

link

rini17 1018 days ago

And you can get reasonably priced notebook with ECC by...?

link

Aurornis 1018 days ago

Lenovo has them. Again, not hard if you look.

I know some people won’t be happy until every laptop has ECC RAM and is super cheap, but the reality is that the demand for ECC RAM is very low. The majority of users would choose the extra battery life and lower price if given the option.

link

rini17 1018 days ago

I looked and it's hard. Had to resort to reddit recommendations.

Nice circular reasoning. But nothing will change till we're not vocal enough about ECC benefits and shady pricing. I assure you though, it's not about my happiness :)

link

RecycledEle 1018 days ago

I love ECC RAM, but I disagree on one small point.

Registered (meaning ECC and buffered) RAM is common in the workstation market, so it is not limited to noisy servers.

Check out HP Z series and Dell Precision workstations. They are available used / refurbished at low prices.

link

andromeduck 1017 days ago

Do you think apple will reintroduce it in the Mac Pros?

link

MichaelZuo 1018 days ago

There’s lots of off the shelf laptops available with ECC memory, some even in slim form factors. For desktops the entire Thinkstation lineup has ECC available to option or as standard.

For the higher priced models you cant even order them with non-ECC memory.

link

hedora 1018 days ago

I think inline ECC (the module performs the ECC) is mandatory with LPDDR4 (the error rates on current silicon are too high to leave it out), but link ECC (between the CPU and the module) is optional.

Note that link ECC + inline ECC don't give you end-to-end protection, since the controller in the memory module can still flip bits. DDR5 is moving to on-die ECC (which, unlike DDR <= 4's side-band ECC) also isn't end-to-end.

I'd like to see side-band ECC continue to exist, but I think it is going to be phased out entirely.

This article defines all the terms, but is very vague about what things are mandatory, or how reliable the error correction schemes are. For instance, it carefully doesn't say that SECDED schemes detect all two bit errors, instead it says they detect at least some:

https://www.synopsys.com/designware-ip/technical-bulletin/er...

link

toast0 1018 days ago

> I'd like to see side-band ECC continue to exist, but I think it is going to be phased out entirely.

I doubt it will be phased out for servers. I haven't seen anyone reporting that on-die ECC in DDR5 has a reporting mechanism, and reporting on ram errors is important for server reliability.

link

bpye 1018 days ago

I really wish we’d just get in-band ECC on normal consumer platforms. That way we’d need no special DIMMs, in applications where ECC was desired it could be enabled and the capacity penalty would be paid, in other applications it could be disabled and no capacity would be lost.

link

gabereiser 1018 days ago

I like this idea. 64gb of ram non-ecc, 48gb in ecc. Dynamic, succinct, and enables more supply chain cross over for not having two (three?) separate DIMM types.

link

jakobson14 1018 days ago

Talk to intel.

On AMD ECC support is pretty much standard on every chip they make, and always has been. Even my shitty 4-core phenom from over ten years ago on an el-cheapo motherboard supported swapping it's regular DIMMs for ECC ones. You're never going to get ECC "for free" but it would be totally possible for everyone to pay the cost once and just move to ECC-only for everything from now on.

Except intel, the company that brought software-locked hardware features to x86, love to price-differentiate.

link

Aurornis 1018 days ago

Having physical memory segments be different logical sizes at runtime depending on the ECC setting does not sound fun.

Having your system’s available memory fluctuate up and down based on how many segments are currently set to ECC also doesn’t sound fun.

Having developers manually turn ECC off for regions where it’s unimportant sounds like a lot of complexity for a relatively rare use case.

There is in-band ECC in some newer Intel designs, but it’s all or nothing. Adding extremely complexity to memory management to selectively disable it sounds like a lot to ask.

link

gabereiser 1018 days ago

I believe it would just be a kernel setting. Developers would just see full capacity or ecc-capacity, they wouldn’t care much why.

link

Aurornis 1018 days ago

It’s implemented as a BIOS setting where it’s supported.

But the parent comment was suggesting that it be on or off depending on the memory segment, which is a completely different problem.

link

saltcured 1018 days ago

I don't see "segment" in the earlier post at all.

I think your reading depends on thinking "application" means "process", while another reading would be that an application is a particular deployed system, where this setting can be altered e.g. at the BIOS level.

link

bpye 1018 days ago

Sorry yes I did mean application as in a deployed system rather than a specific process.

link

gabereiser 1018 days ago

likewise, I assumed it was a hard system-wide setting and not application specific.

link

hinkley 1018 days ago

Doesn’t DDR5 require ECC to function properly? I think we’ve gotten to the point that we need extended error correction as a mark of robustness. E2C2.

link

fweimer 1018 days ago

It does, but this particular implementation is local to the module, and cannot be used for secondary purposes in addition to error correction, such as storing tag bits.

link

fbdab103 1018 days ago

As a consumer does that matter? I understand server grade hardware wants the extra monitoring/diagnostic gizmos, but will the memory be corrected with the same efficiency as DDR4 ECC or is it an entirely neutered implementation?

link

ElectricalUnion 1018 days ago

Not entirely neutered, but at least "data-at-rest" is protected, while "data-in-flight" is not.

link

undersuit 1018 days ago

I'm not sold on on-die DDR5 ECC providing protecting.

On-die ECC allowed DDR5 to be competitive with DDR4. Is it really protecting your data at rest if the DDR5 die is running at such tolerances that it's correcting single bit errors from internal signalling issues every transaction? It's only single bit ECC, if something else outside of the die(Cosmic Ray, sudden voltage change, sudden temperature change) induces a bit to flip while the internal circuitry causes a different bit to flip your data is now corrupt.

https://www.atpinc.com/tw/blog/ddr5-what-is-on-die-ecc-how-i...

link

fbdab103 1018 days ago

Is there any intuition about how frequently data-at-rest errors occur vs data-in-flight? Would the native DDR5 ECC get me 90% of the way there or is it so minor as to be effectively meaningless?

I assume it is going to take another decade to fully unwind Intel's ECC market segmentation. Trying to get a sense on if I should pay the ECC tax for my next build. Of course noting that as a consumer, I will probably never notice a flipped bit.

link

lazide 1018 days ago

You’d ‘notice’ the flipped bits usually as rare, random, and impossible to reproduce crashes and lockups with the occasional data corruption.

Which is often background noise for home users, but no less problematic.

Often heat/load dependent too.

link

Keyframe 1018 days ago

I actually look forward to (promised) future where "disk" storage is fast enough not to need RAM anymore.

link

brookst 1018 days ago

The convergence of volatile and non-volatile storage is one of the most exciting upcoming technologies, and always will be.

link

Keyframe 1018 days ago

yeah, it's a bit of fusion but for computing.. always 50 years out. Some day! Maybe.

link

ls612 1018 days ago

With the failure of Optane I doubt that it will be coming anytime soon.

link

undersuit 1018 days ago

The merging of CXL and NVME is just one frustrated vendor away.

link

bheadmaster 1018 days ago

Runtime asserts and invariant checks in software can also help a lot with isolating bitflip errors. With a nice addition of also isolating effects of software bugs.

link

GuB-42 1018 days ago

I don't know if it is significant. Runtime checks tend to focus on small but critical part of the data, like size fields. It usually doesn't check bulk data, like decompressed image data, or code, and it also may not be effective if data is in cache. Furthermore, it will only detect errors, not correct them. Also the performance cost is, I think, much higher than the extra RAM chip. Good coding practice for critical path in software, but clearly, it doesn't substitute for dedicated hardware.

I have had defective RAM, and I got quite a bit of corruption before the first crashes, it is hardly noticeable when it is just a pixel changing color in a picture, but it is still something you don't want. ECC would have prevented that.

I know there is software resistant on random bitflips, like for satellites exposed to cosmic rays, but it is a highly specialized field. It is also a field where they use special chips, typically with coarser (and therefore less efficient) dies that are more resistant to radiation. You leave a lot on the table for that.

link

gumby 1018 days ago

ECC is better handled in hardware: most of the time it won’t happen, and the hardware can more easily interrupt the processor so the kernel can correct the problem or signal a fault if it’s not a correctable corruption.

link

lazide 1018 days ago

Those only help isolate somewhat predictable errors. Which is rare for what ECC is designed to protect against.

If it’s a random, once in several billion reads/writes issue, it can just stop/identify the bad data from further propagating. Sometimes. That data is still lost.

ECC does forward error correction, which is extremely rare for the type of data protection you’re talking about. and if the data is corrupted in RAM (say when initially loaded/read) before the software can apply FEC, there is nothing the software can do.

link

ElectricalUnion 1018 days ago

I thought that the current wave of compiler correctness checking, zero-cost abstractions, JIT compilers and speculative processor behaviour were all about removing those "unnecessary" runtime asserts and invariant checks to get better performance.

link

mastax 1018 days ago

Assuming the compiler doesn't optimize them out.

link

ikekkdcjkfke 1018 days ago

All ddr5 has ecc

link

pixl97 1018 days ago

But it does not have a means of reporting ECC triggers to the user from my understanding, which is really one of the most important parts.

When ECC starts tripping on a device outside of completely random times is when you should look into what's going wrong. You may have overheating or failing hardware.

link

drzaiusapelord 1018 days ago

Wikipedia: Unlike DDR4, all DDR5 chips have on-die ECC, where errors are detected and corrected before sending data to the CPU. This, however, is not the same as true ECC memory with extra data correction chips on the memory module.

So I'm not sure how this works, because I'm not sure if "true" ECC is better/worse/same as on-die ECC. A casual googling shows on-die to have more advantages.

link