| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gchamonlive 1119 days ago
	It means anyone launching amd powered virtual machines on cloud providers can experience this now, at any point, and you don't know when it will happen, given this type of CPU could have been bought, booted or rebooted anytime in the past three years. Seems comparably problematic to me.

6 comments

PragmaticPulp 1119 days ago

This depends on the C6 sleep state being enabled and the server not having been restarted for 3 years. It’s extremely unlikely that your cloud provider servers are going to meet this criteria and ignore this errata for a part they’ve bought thousands of.

So no, it’s not going to start randomly hitting people.

> Seems comparably problematic to me.

Not even close. The FDIV bug hit common operations that could be issued millions of times per second. This bug only applies to specific configurations that haven’t been rebooted for 3 years and has a clear workaround.

They’re not even close to comparable in impact and ability to work around. Literally many orders of magnitude different.

gchamonlive 1119 days ago

Thanks for explaining the issue more! I must say I wasn't too familiar with Intel's issue when I wrote the comment.

tpetry 1119 days ago

The cloud providers now know of this bug. They will live migrate you to a different machine or shutdown reboot. Only on-premise will have this issue.

gchamonlive 1119 days ago

They won't though. An EC2 stays in the same server even if its service is degraded, afaik.

foobiekr 1119 days ago

Which CSPs do live migration?

nielsole 1119 days ago

https://cloud.google.com/compute/docs/instances/live-migrati...

Not sure about others

lanstin 1119 days ago

And it works miraculously well. I have seen a very large Oracle DB with many reads and many updates migrate with no effects outside of a big spike in sql execution times for the few hundred milliseconds it takes between the source pause and the destination resume. Seen various GCP bugs, mostly VPN and pub sun, but never anything from migration. They migrate about all the instances every two weeks, so it is an often exercised path.

Waterluvian 1119 days ago

I’d prefer a bug that crashes a program than one that quietly inserts wrong data and keeps going.

bee_rider 1119 days ago

It probably depends on your workload, which is a bigger deal. The fdiv bug was pretty bad, but at least fixable in software (at some cost). Anyway, recall is the right decision in either case (unless there’s a good enough workaround).

__alexs 1119 days ago

If every CPU with an errata that needed software workarounds was recalled there would be no CPUs to use.

bee_rider 1119 days ago

That’s why I said “unless there are good enough workarounds.” You buy a part with some performance/power consumption expectations.

It sounds like a workaround here could be to disable C6 sleep, so I guess we’ll see how much that violates those expectations. I guess they didn’t add the feature for no reason, though.

kllrnohj 1119 days ago

The other workaround is to reboot at least once every 3 years, which surely most users are doing anyway to pick up on security patches & similar.

Exceptions definitely exist, but the workarounds are both pretty straightforward and you can pick whichever is less impactful.

imtringued 1119 days ago

It is a non-issue. If you need 3 years of permanent uptime, then what exactly do you need a deep sleep state for that is basically the same as turning the CPU off?

gchamonlive 1119 days ago

Me too, that is why I said the problems are comparable, not the same.

viraptor 1119 days ago

> amd powered virtual machines on cloud providers

Cloud providers are very unlikely to use sleep states. I mean, is possible... but I'd bet against it.

gchamonlive 1119 days ago

I have to admit I wasn't aware of sleep states being necessary for the problem do arise when I wrote the comment.

callamdelaney 1119 days ago

I think 30 seconds of downtime over 3 years probably isn't that much of an issue for anybody. Floating point calculations being wrong though.. that's a bigger problem.

numpad0 1119 days ago

Server hardware routinely take longer than 30 seconds to boot up, sometimes just to wait for power to stabilize just in case it matters, sometimes to do a staggered spinup of HDD to avoid current spike overloading something(it sounds cool!)

Sakos 1119 days ago

Google already does preemptive VMs where an instance can go down if it's needed elsewhere. It's something you can design your services to handle easily, if you aren't already doing so.

Why wouldn't cloud providers be aware of how long a specific CPU has been up and plan around it? Also, do cloud providers generally never reboot their systems?

gchamonlive 1119 days ago

In AWS, if you keep a long running VM, it will keep running in the same server, even if degraded afaik. Even a reboot won't migrate to a new server. You have to shut it down and power it back up. This I learned back in 2019, so it could have changed but I doubt it.

p_l 1119 days ago

GCP also does live migration for standard instances.