Hacker News new | ask | show | jobs
by gchamonlive 1119 days ago
It means anyone launching amd powered virtual machines on cloud providers can experience this now, at any point, and you don't know when it will happen, given this type of CPU could have been bought, booted or rebooted anytime in the past three years.

Seems comparably problematic to me.

6 comments

This depends on the C6 sleep state being enabled and the server not having been restarted for 3 years. It’s extremely unlikely that your cloud provider servers are going to meet this criteria and ignore this errata for a part they’ve bought thousands of.

So no, it’s not going to start randomly hitting people.

> Seems comparably problematic to me.

Not even close. The FDIV bug hit common operations that could be issued millions of times per second. This bug only applies to specific configurations that haven’t been rebooted for 3 years and has a clear workaround.

They’re not even close to comparable in impact and ability to work around. Literally many orders of magnitude different.

Thanks for explaining the issue more! I must say I wasn't too familiar with Intel's issue when I wrote the comment.
The cloud providers now know of this bug. They will live migrate you to a different machine or shutdown reboot. Only on-premise will have this issue.
They won't though. An EC2 stays in the same server even if its service is degraded, afaik.
Which CSPs do live migration?
And it works miraculously well. I have seen a very large Oracle DB with many reads and many updates migrate with no effects outside of a big spike in sql execution times for the few hundred milliseconds it takes between the source pause and the destination resume. Seen various GCP bugs, mostly VPN and pub sun, but never anything from migration. They migrate about all the instances every two weeks, so it is an often exercised path.
I’d prefer a bug that crashes a program than one that quietly inserts wrong data and keeps going.
It probably depends on your workload, which is a bigger deal. The fdiv bug was pretty bad, but at least fixable in software (at some cost). Anyway, recall is the right decision in either case (unless there’s a good enough workaround).
If every CPU with an errata that needed software workarounds was recalled there would be no CPUs to use.
That’s why I said “unless there are good enough workarounds.” You buy a part with some performance/power consumption expectations.

It sounds like a workaround here could be to disable C6 sleep, so I guess we’ll see how much that violates those expectations. I guess they didn’t add the feature for no reason, though.

The other workaround is to reboot at least once every 3 years, which surely most users are doing anyway to pick up on security patches & similar.

Exceptions definitely exist, but the workarounds are both pretty straightforward and you can pick whichever is less impactful.

It is a non-issue. If you need 3 years of permanent uptime, then what exactly do you need a deep sleep state for that is basically the same as turning the CPU off?
Me too, that is why I said the problems are comparable, not the same.
> amd powered virtual machines on cloud providers

Cloud providers are very unlikely to use sleep states. I mean, is possible... but I'd bet against it.

I have to admit I wasn't aware of sleep states being necessary for the problem do arise when I wrote the comment.
I think 30 seconds of downtime over 3 years probably isn't that much of an issue for anybody. Floating point calculations being wrong though.. that's a bigger problem.
Server hardware routinely take longer than 30 seconds to boot up, sometimes just to wait for power to stabilize just in case it matters, sometimes to do a staggered spinup of HDD to avoid current spike overloading something(it sounds cool!)
Google already does preemptive VMs where an instance can go down if it's needed elsewhere. It's something you can design your services to handle easily, if you aren't already doing so.

Why wouldn't cloud providers be aware of how long a specific CPU has been up and plan around it? Also, do cloud providers generally never reboot their systems?

In AWS, if you keep a long running VM, it will keep running in the same server, even if degraded afaik. Even a reboot won't migrate to a new server. You have to shut it down and power it back up. This I learned back in 2019, so it could have changed but I doubt it.
GCP also does live migration for standard instances.