Hacker News new | ask | show | jobs
by AaronFriel 1990 days ago
It can't eliminate it but:

1. Single bitflip correction along with Google's metrics could help them identify algorithms they've got, customer's VMs that are causing bitflips via rowhammer and machines which have errors regardless of workload

2. Double bitflip detection lets Google decide if they say, want to panic at that point and take the machine out of service, and they can report on what software was running or why. Their SREs are world-class and may be able to deduce if this was a fluke (orders of magnitude less likely than a single bit flip), if a workload caused it, or if hardware caused it.

The advantage the 3 major cloud providers have is scale. If a Fortune 500 were running their own datacenters, how likely would it be that they have the same level of visibility into their workloads, the quality of SREs to diagnose, and the sheer statistical power of scale?

I sincerely hope Google is not simply silencing bitflip corrections and detections. That would be a profound waste.

1 comments

ECC seems like a trivial thing to log and keep track of. Surely any Fortune 500 could do it and would have enough scale to get meaningful data out of it?
It's not just tracking ECC errors, which as you point out is not hard, but correlating it with the other metrics needed to determine the cause and having the scale to reliably root cause bitflips to software (workloads that inadvertently rowhammer) or hardware or even malicious users (GCP customers that may intentionally run a rowhammer.)
IBM does. They will probably sell you the information if you rent the machines from them.