| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by uiohnuipb 6105 days ago
	And try and convince a programmer that it's possible that their program's memory can be wrong. They understand in theory but refuse to code for the possibility. Especialy when you get into HPC and there are clusters of 50-60 machines with 4Gb each, the chance of not having corrupt memory is almost 0.

2 comments

jbellis 6105 days ago

> And try and convince a programmer that it's possible that their program's memory can be wrong. They understand in theory but refuse to code for the possibility.

Because the hardware can still detect multi-bit errors, just not transparently correct them. So you shut the machine down automatically until you get new dram installed.

Programmers _are_ coding for machines-will-fail-temporarily, but coding to "handle" random memory errors instead of buying the right ECC hardware would be insane.

link

Psyonic 6105 days ago

I'm honestly curious... what kind of defensive programming techniques could you use to try and deal with this?

link

dmm 6105 days ago

Do everything twice and make sure the results match.

link

pyre 6105 days ago

What happens when the code doing the comparison becomes corrupted? Do the comparison twice? What happens when the code controlling the evaluation of both comparisons becomes corrupted?

Your data and your instruction set are in the same memory. Even if they are separated into different areas of memory to prevent buffer overflow exploits, it's all still in memory. Once the memory starts going, you're kind of screwed. It's the same as how -- in respect to computer security -- once someone has physical access to the machine, you're screwed.

With respect to memory errors in distributed environments, usually such environments are distributed to increased the processing power for number crunching. If you run all calculations twice and have code comparing them for acceptance, you're more than doubling your processing requirements.

But at the end of the day, it's all a matter of what level of risk is acceptable (or tolerable). There is no magic bullet to fix these issues.

link

wmf 6105 days ago

You're ignoring that the voting code would be a very small fraction of your RAM and thus less likely to be corrupted. But it's academic since no one runs twice to avoid the cost of ECC.

link

pyre 6105 days ago

I realize that. My point is that there is no 100% solution.

link

ppereira 6105 days ago

The IBM z9 processor, if I remember correctly, can do the comparison in hardware. When the cores consistently fail to match, the system can probably also call the IBM service technician.

link

moe 6105 days ago

The z-series stuff is indeed some amazing piece of kit.

Yes, the machine can and does call the technician when something fails. And iirc everything, including the CPUs, is hot-swappable. That means you can physically remove a CPU-book (containing processors and RAM) and your OS will keep running.

Quite a nerds dream, if you have the spare change...

link

gnaritas 6105 days ago

Yea, programmers don't do that because in the vast majority of programs written, this would be absurd, costly, and not at all worth the effort.

link

mey 6105 days ago

To be safe you need to do everything 3 times and call a vote. Typically with 3 complete systems of identical nature.

link

pyre 6105 days ago

Well, once you've done it twice and the results don't match you'll probably re-run it a third time. It wouldn't make sense to just choose the result that 'makes sense' at that point.

link

btilly 6105 days ago

OK, you run it twice and the results differ. You go back and look at what the starting state should be and the starting states differ. Where do you get the definitely correct data from for the third run?

link