Hacker News new | ask | show | jobs
by Waterluvian 1117 days ago
It feels a little bit different.

One creates uncertainty in all floating point results, given you don’t know when it happens. The other requires you to reboot maybe every ~3 years and you know exactly when it happens.

I’m not saying we should tolerate a defect, but it doesn’t feel nearly as problematic.

4 comments

It also has a fairly easy solution: disable the CC6 sleep state. The practical effects from that will most likely be minimal or non-existent for most users of these CPUs.
> disable the CC6 sleep state.

This is now the second time AMD has screwed up the C6 state. Ryzen first gen would hang daily for me when due to a similar bug.

I don't understand the nature of the relationship between a motherboard manufacturer and AMD but when I got my MSI Tomahawk board for my Ryzen I really thought I was losing my mind. I would have USB devices stop working at the most random of times and some of them would continually cycle between connected and not connected.

A motherboard update from MSI applied something from AMD and that fixed the issue.

They’ve improved by three decimal orders of magnitude since then. How much more can we ask of them?
I can already see the pain of the myriads of compliance (to all energy reduction directives, at least in EU) people getting strangely obtuse notes from their sw/hw/platform teams, saying in essence, errrrr we need to amend our already thick justification folder, to disable a specific sleep state. I feel a migraine (or a kind of sketch) coming. 'oh and BTW we're field upgrading the whole fleet'.

I guess fighting tooth and nail to disable any and all of these sleep states from the get go is worth it...

Would this qualify as more CPU errata?
There's errata and errata...

As a systems seller you get most of the markup but also most of the responsibility, so handwaving 'sorry AMD fucked up' won't do it. You know have an installed base that might crash every 1024 days, which for unattended systems is long but not that long. Worse if you have hardware redundancy, there's still a chance they all booted around the same time so will crash around the same time.

Customers will be proactive and follow the intelligent periodic reboot schedule you propose for a time (see the 787 overflow bugs stories), while asking for a fix. The fix needs to still be OK with all the specs you sold. If one of these specs depends on sleep states, you'll have to find a solution around it and deploy it fleetwide. If a microcode update fixes it, yay. If the problem can't be winked away with a software patch, now the blast radius is bigger and you're still supposed to do as much as possible to use the least energy possible in most idle states...

It means anyone launching amd powered virtual machines on cloud providers can experience this now, at any point, and you don't know when it will happen, given this type of CPU could have been bought, booted or rebooted anytime in the past three years.

Seems comparably problematic to me.

This depends on the C6 sleep state being enabled and the server not having been restarted for 3 years. It’s extremely unlikely that your cloud provider servers are going to meet this criteria and ignore this errata for a part they’ve bought thousands of.

So no, it’s not going to start randomly hitting people.

> Seems comparably problematic to me.

Not even close. The FDIV bug hit common operations that could be issued millions of times per second. This bug only applies to specific configurations that haven’t been rebooted for 3 years and has a clear workaround.

They’re not even close to comparable in impact and ability to work around. Literally many orders of magnitude different.

Thanks for explaining the issue more! I must say I wasn't too familiar with Intel's issue when I wrote the comment.
The cloud providers now know of this bug. They will live migrate you to a different machine or shutdown reboot. Only on-premise will have this issue.
They won't though. An EC2 stays in the same server even if its service is degraded, afaik.
Which CSPs do live migration?
And it works miraculously well. I have seen a very large Oracle DB with many reads and many updates migrate with no effects outside of a big spike in sql execution times for the few hundred milliseconds it takes between the source pause and the destination resume. Seen various GCP bugs, mostly VPN and pub sun, but never anything from migration. They migrate about all the instances every two weeks, so it is an often exercised path.
I’d prefer a bug that crashes a program than one that quietly inserts wrong data and keeps going.
It probably depends on your workload, which is a bigger deal. The fdiv bug was pretty bad, but at least fixable in software (at some cost). Anyway, recall is the right decision in either case (unless there’s a good enough workaround).
If every CPU with an errata that needed software workarounds was recalled there would be no CPUs to use.
That’s why I said “unless there are good enough workarounds.” You buy a part with some performance/power consumption expectations.

It sounds like a workaround here could be to disable C6 sleep, so I guess we’ll see how much that violates those expectations. I guess they didn’t add the feature for no reason, though.

The other workaround is to reboot at least once every 3 years, which surely most users are doing anyway to pick up on security patches & similar.

Exceptions definitely exist, but the workarounds are both pretty straightforward and you can pick whichever is less impactful.

It is a non-issue. If you need 3 years of permanent uptime, then what exactly do you need a deep sleep state for that is basically the same as turning the CPU off?
Me too, that is why I said the problems are comparable, not the same.
> amd powered virtual machines on cloud providers

Cloud providers are very unlikely to use sleep states. I mean, is possible... but I'd bet against it.

I have to admit I wasn't aware of sleep states being necessary for the problem do arise when I wrote the comment.
I think 30 seconds of downtime over 3 years probably isn't that much of an issue for anybody. Floating point calculations being wrong though.. that's a bigger problem.
Server hardware routinely take longer than 30 seconds to boot up, sometimes just to wait for power to stabilize just in case it matters, sometimes to do a staggered spinup of HDD to avoid current spike overloading something(it sounds cool!)
Google already does preemptive VMs where an instance can go down if it's needed elsewhere. It's something you can design your services to handle easily, if you aren't already doing so.

Why wouldn't cloud providers be aware of how long a specific CPU has been up and plan around it? Also, do cloud providers generally never reboot their systems?

In AWS, if you keep a long running VM, it will keep running in the same server, even if degraded afaik. Even a reboot won't migrate to a new server. You have to shut it down and power it back up. This I learned back in 2019, so it could have changed but I doubt it.
GCP also does live migration for standard instances.
To be fair, it was possible to tell what operations would be off in the FDIV bug and by how much. It was 100% deterministic. Problem was, checking all the operands in SW before performing the computation to make adjustments completely defeated the purpose of having an FPU.
Especially compared to something like this: https://www.theregister.com/2020/04/02/boeing_787_power_cycl...