| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Waterluvian 1117 days ago

It feels a little bit different.

One creates uncertainty in all floating point results, given you don’t know when it happens. The other requires you to reboot maybe every ~3 years and you know exactly when it happens.

I’m not saying we should tolerate a defect, but it doesn’t feel nearly as problematic.

4 comments

arp242 1117 days ago

It also has a fairly easy solution: disable the CC6 sleep state. The practical effects from that will most likely be minimal or non-existent for most users of these CPUs.

bioemerl 1117 days ago

> disable the CC6 sleep state.

This is now the second time AMD has screwed up the C6 state. Ryzen first gen would hang daily for me when due to a similar bug.

sidewndr46 1117 days ago

I don't understand the nature of the relationship between a motherboard manufacturer and AMD but when I got my MSI Tomahawk board for my Ryzen I really thought I was losing my mind. I would have USB devices stop working at the most random of times and some of them would continually cycle between connected and not connected.

A motherboard update from MSI applied something from AMD and that fixed the issue.

allenrb 1117 days ago

They’ve improved by three decimal orders of magnitude since then. How much more can we ask of them?

touisteur 1117 days ago

I can already see the pain of the myriads of compliance (to all energy reduction directives, at least in EU) people getting strangely obtuse notes from their sw/hw/platform teams, saying in essence, errrrr we need to amend our already thick justification folder, to disable a specific sleep state. I feel a migraine (or a kind of sketch) coming. 'oh and BTW we're field upgrading the whole fleet'.

I guess fighting tooth and nail to disable any and all of these sleep states from the get go is worth it...

paulryanrogers 1117 days ago

Would this qualify as more CPU errata?

touisteur 1117 days ago

There's errata and errata...

As a systems seller you get most of the markup but also most of the responsibility, so handwaving 'sorry AMD fucked up' won't do it. You know have an installed base that might crash every 1024 days, which for unattended systems is long but not that long. Worse if you have hardware redundancy, there's still a chance they all booted around the same time so will crash around the same time.

Customers will be proactive and follow the intelligent periodic reboot schedule you propose for a time (see the 787 overflow bugs stories), while asking for a fix. The fix needs to still be OK with all the specs you sold. If one of these specs depends on sleep states, you'll have to find a solution around it and deploy it fleetwide. If a microcode update fixes it, yay. If the problem can't be winked away with a software patch, now the blast radius is bigger and you're still supposed to do as much as possible to use the least energy possible in most idle states...

gchamonlive 1117 days ago

It means anyone launching amd powered virtual machines on cloud providers can experience this now, at any point, and you don't know when it will happen, given this type of CPU could have been bought, booted or rebooted anytime in the past three years.

Seems comparably problematic to me.

PragmaticPulp 1117 days ago

This depends on the C6 sleep state being enabled and the server not having been restarted for 3 years. It’s extremely unlikely that your cloud provider servers are going to meet this criteria and ignore this errata for a part they’ve bought thousands of.

So no, it’s not going to start randomly hitting people.

> Seems comparably problematic to me.

Not even close. The FDIV bug hit common operations that could be issued millions of times per second. This bug only applies to specific configurations that haven’t been rebooted for 3 years and has a clear workaround.

They’re not even close to comparable in impact and ability to work around. Literally many orders of magnitude different.

gchamonlive 1117 days ago

Thanks for explaining the issue more! I must say I wasn't too familiar with Intel's issue when I wrote the comment.

tpetry 1117 days ago

The cloud providers now know of this bug. They will live migrate you to a different machine or shutdown reboot. Only on-premise will have this issue.

gchamonlive 1117 days ago

They won't though. An EC2 stays in the same server even if its service is degraded, afaik.

foobiekr 1117 days ago

Which CSPs do live migration?

nielsole 1117 days ago

https://cloud.google.com/compute/docs/instances/live-migrati...

Not sure about others

lanstin 1117 days ago

And it works miraculously well. I have seen a very large Oracle DB with many reads and many updates migrate with no effects outside of a big spike in sql execution times for the few hundred milliseconds it takes between the source pause and the destination resume. Seen various GCP bugs, mostly VPN and pub sun, but never anything from migration. They migrate about all the instances every two weeks, so it is an often exercised path.

Waterluvian 1117 days ago

I’d prefer a bug that crashes a program than one that quietly inserts wrong data and keeps going.

bee_rider 1117 days ago

It probably depends on your workload, which is a bigger deal. The fdiv bug was pretty bad, but at least fixable in software (at some cost). Anyway, recall is the right decision in either case (unless there’s a good enough workaround).

__alexs 1117 days ago

If every CPU with an errata that needed software workarounds was recalled there would be no CPUs to use.

bee_rider 1117 days ago

That’s why I said “unless there are good enough workarounds.” You buy a part with some performance/power consumption expectations.

It sounds like a workaround here could be to disable C6 sleep, so I guess we’ll see how much that violates those expectations. I guess they didn’t add the feature for no reason, though.

kllrnohj 1117 days ago

The other workaround is to reboot at least once every 3 years, which surely most users are doing anyway to pick up on security patches & similar.

Exceptions definitely exist, but the workarounds are both pretty straightforward and you can pick whichever is less impactful.

imtringued 1117 days ago

It is a non-issue. If you need 3 years of permanent uptime, then what exactly do you need a deep sleep state for that is basically the same as turning the CPU off?

gchamonlive 1117 days ago

Me too, that is why I said the problems are comparable, not the same.

viraptor 1117 days ago

> amd powered virtual machines on cloud providers

Cloud providers are very unlikely to use sleep states. I mean, is possible... but I'd bet against it.

gchamonlive 1117 days ago

I have to admit I wasn't aware of sleep states being necessary for the problem do arise when I wrote the comment.

callamdelaney 1117 days ago

I think 30 seconds of downtime over 3 years probably isn't that much of an issue for anybody. Floating point calculations being wrong though.. that's a bigger problem.

numpad0 1117 days ago

Server hardware routinely take longer than 30 seconds to boot up, sometimes just to wait for power to stabilize just in case it matters, sometimes to do a staggered spinup of HDD to avoid current spike overloading something(it sounds cool!)

Sakos 1117 days ago

Google already does preemptive VMs where an instance can go down if it's needed elsewhere. It's something you can design your services to handle easily, if you aren't already doing so.

Why wouldn't cloud providers be aware of how long a specific CPU has been up and plan around it? Also, do cloud providers generally never reboot their systems?

gchamonlive 1117 days ago

In AWS, if you keep a long running VM, it will keep running in the same server, even if degraded afaik. Even a reboot won't migrate to a new server. You have to shut it down and power it back up. This I learned back in 2019, so it could have changed but I doubt it.

p_l 1117 days ago

GCP also does live migration for standard instances.

0xr0kk3r 1117 days ago

To be fair, it was possible to tell what operations would be off in the FDIV bug and by how much. It was 100% deterministic. Problem was, checking all the operands in SW before performing the computation to make adjustments completely defeated the purpose of having an FPU.

KptMarchewa 1117 days ago

Especially compared to something like this: https://www.theregister.com/2020/04/02/boeing_787_power_cycl...