Hacker News new | ask | show | jobs
by RobotToaster 1117 days ago
I feel like some of the comments here are missing the point. Yes it's only likely to effect a small number of users, so did the intel fdiv bug, both are defective products.

Back then intel were pressured into a recall, today we seem too willing to put up with being sold broken stuff.

8 comments

It feels a little bit different.

One creates uncertainty in all floating point results, given you don’t know when it happens. The other requires you to reboot maybe every ~3 years and you know exactly when it happens.

I’m not saying we should tolerate a defect, but it doesn’t feel nearly as problematic.

It also has a fairly easy solution: disable the CC6 sleep state. The practical effects from that will most likely be minimal or non-existent for most users of these CPUs.
> disable the CC6 sleep state.

This is now the second time AMD has screwed up the C6 state. Ryzen first gen would hang daily for me when due to a similar bug.

I don't understand the nature of the relationship between a motherboard manufacturer and AMD but when I got my MSI Tomahawk board for my Ryzen I really thought I was losing my mind. I would have USB devices stop working at the most random of times and some of them would continually cycle between connected and not connected.

A motherboard update from MSI applied something from AMD and that fixed the issue.

They’ve improved by three decimal orders of magnitude since then. How much more can we ask of them?
I can already see the pain of the myriads of compliance (to all energy reduction directives, at least in EU) people getting strangely obtuse notes from their sw/hw/platform teams, saying in essence, errrrr we need to amend our already thick justification folder, to disable a specific sleep state. I feel a migraine (or a kind of sketch) coming. 'oh and BTW we're field upgrading the whole fleet'.

I guess fighting tooth and nail to disable any and all of these sleep states from the get go is worth it...

Would this qualify as more CPU errata?
There's errata and errata...

As a systems seller you get most of the markup but also most of the responsibility, so handwaving 'sorry AMD fucked up' won't do it. You know have an installed base that might crash every 1024 days, which for unattended systems is long but not that long. Worse if you have hardware redundancy, there's still a chance they all booted around the same time so will crash around the same time.

Customers will be proactive and follow the intelligent periodic reboot schedule you propose for a time (see the 787 overflow bugs stories), while asking for a fix. The fix needs to still be OK with all the specs you sold. If one of these specs depends on sleep states, you'll have to find a solution around it and deploy it fleetwide. If a microcode update fixes it, yay. If the problem can't be winked away with a software patch, now the blast radius is bigger and you're still supposed to do as much as possible to use the least energy possible in most idle states...

It means anyone launching amd powered virtual machines on cloud providers can experience this now, at any point, and you don't know when it will happen, given this type of CPU could have been bought, booted or rebooted anytime in the past three years.

Seems comparably problematic to me.

This depends on the C6 sleep state being enabled and the server not having been restarted for 3 years. It’s extremely unlikely that your cloud provider servers are going to meet this criteria and ignore this errata for a part they’ve bought thousands of.

So no, it’s not going to start randomly hitting people.

> Seems comparably problematic to me.

Not even close. The FDIV bug hit common operations that could be issued millions of times per second. This bug only applies to specific configurations that haven’t been rebooted for 3 years and has a clear workaround.

They’re not even close to comparable in impact and ability to work around. Literally many orders of magnitude different.

Thanks for explaining the issue more! I must say I wasn't too familiar with Intel's issue when I wrote the comment.
The cloud providers now know of this bug. They will live migrate you to a different machine or shutdown reboot. Only on-premise will have this issue.
They won't though. An EC2 stays in the same server even if its service is degraded, afaik.
Which CSPs do live migration?
And it works miraculously well. I have seen a very large Oracle DB with many reads and many updates migrate with no effects outside of a big spike in sql execution times for the few hundred milliseconds it takes between the source pause and the destination resume. Seen various GCP bugs, mostly VPN and pub sun, but never anything from migration. They migrate about all the instances every two weeks, so it is an often exercised path.
I’d prefer a bug that crashes a program than one that quietly inserts wrong data and keeps going.
It probably depends on your workload, which is a bigger deal. The fdiv bug was pretty bad, but at least fixable in software (at some cost). Anyway, recall is the right decision in either case (unless there’s a good enough workaround).
If every CPU with an errata that needed software workarounds was recalled there would be no CPUs to use.
That’s why I said “unless there are good enough workarounds.” You buy a part with some performance/power consumption expectations.

It sounds like a workaround here could be to disable C6 sleep, so I guess we’ll see how much that violates those expectations. I guess they didn’t add the feature for no reason, though.

Me too, that is why I said the problems are comparable, not the same.
> amd powered virtual machines on cloud providers

Cloud providers are very unlikely to use sleep states. I mean, is possible... but I'd bet against it.

I have to admit I wasn't aware of sleep states being necessary for the problem do arise when I wrote the comment.
I think 30 seconds of downtime over 3 years probably isn't that much of an issue for anybody. Floating point calculations being wrong though.. that's a bigger problem.
Server hardware routinely take longer than 30 seconds to boot up, sometimes just to wait for power to stabilize just in case it matters, sometimes to do a staggered spinup of HDD to avoid current spike overloading something(it sounds cool!)
Google already does preemptive VMs where an instance can go down if it's needed elsewhere. It's something you can design your services to handle easily, if you aren't already doing so.

Why wouldn't cloud providers be aware of how long a specific CPU has been up and plan around it? Also, do cloud providers generally never reboot their systems?

In AWS, if you keep a long running VM, it will keep running in the same server, even if degraded afaik. Even a reboot won't migrate to a new server. You have to shut it down and power it back up. This I learned back in 2019, so it could have changed but I doubt it.
GCP also does live migration for standard instances.
To be fair, it was possible to tell what operations would be off in the FDIV bug and by how much. It was 100% deterministic. Problem was, checking all the operands in SW before performing the computation to make adjustments completely defeated the purpose of having an FPU.
Especially compared to something like this: https://www.theregister.com/2020/04/02/boeing_787_power_cycl...
There are always bugs in silicon, just like there are bugs in software. They mostly show up under "a highly specific and detailed set of internal timing conditions". There are 40 documented erratas on EPYC 7002s alone; there are 35 in the 13gen Intel CPUs, including, curiously, RPL038, "Processor Exiting Package C6 or C8 May Hang". Mobile ARM chip manufacturers are notoriously bad at documenting their bugs, so who knows how many they have.

This one is interesting because its preconditions are so trivial, and it will affect many more people than usual.

> Back then intel were pressured into a recall, today we seem too willing to put up with being sold broken stuff.

This bug only applies to servers that haven’t been rebooted for 3 years and have the CC6 sleep state enabled. It can be worked around by disabling CC6 sleep state or rebooting once every 3 years.

If you think operators of these servers can’t be bothered to update and reboot their machines once in 3 years or change a single BIOS setting, what makes you think they’d be interested in tearing down their servers, physically replacing the CPU, and reassembling all of them with the associated downtime and inevitable accidental damage to some units? Nothing about that makes sense from a business perspective.

I’m picturing a long 50’ aisle filled with racks and a guy with a huge box marked “replacement CPUs” and a screwdriver.

Good lord, can you imagine how long just a few of those would take in a data center?

I remember coming to work one morning and having staff at two tables with boxes of RSA keys, and swapping everyones...

(they replaced 40 million of those things..)

https://arstechnica.com/information-technology/2011/06/rsa-f...

The old every CPU is sacred idea lives on.
A key difference between then and now is how much easier it is to distribute software/firmware workarounds or fixes. From an end users perspective replacing the CPU might be seen as far easier than updating their software. A software fix would affect performance, so of course it isn't as simple as that, but this difference is part of the dynamic.

Also, as a direct user of the CPU, if the fdiv bug would impact you it would affect you often rather than once every three years which is the impact frequency of this fault.

Another matter that affected the fdiv bug is that the Pentium line was the first time a CPU had been aggressively marketed directly at the general public in quite the way it was. Prior to that only manufacturers and techies would have known about it and they were used to errata for hardware components. The public more generally had an impression that hardware (at least undamaged hardware) was reliable and only software had bugs, and the fdiv bug invalidated that view of reality causing a bit of a panic.

These types of bugs have been in hardware forever. Nobody is going to replace hundreds of EPYC servers even if they could get a free replacement from AMD.

There are definitely cases where hardware should be exchanged with fixed chips, particularly the small business/consumer/hobbyist range where exchanging CPUs is worth the time and effort. The RDRAND problem with Ryzen chips was much worse because it actually happened all the time and there is still no microcode fix available for some motherboards (though AMD already makes the fix available so it's more of an issue about a lack of motherboard support than broken hardware).

> today we seem too willing to put up with being sold broken stuff.

i remember reading that when hard disks just came into the mass market they were so expensive that having some bad sectors was not such a big deal... and so hard disk would usually come with a sheet of paper listing the known broken sectors (detected at QA stage, i guess).

maybe someone older than me (i guess somebody in their 50ies or 60ies) could confirm that.

I'm not that old, but I remember seeing bad sector lists as stickers on some hard disks.

I'm not sure if that ever went away, though... I think the IDE firmware in more modern hard disks knew how to redirect bad sectors to good sectors, so the end user never even noticed.

I'm way too young to remember it clearly but from what I was told it was nothing of the sort. Intel announced that they had identified a bug and would review on a case by case basis to see who was affected and would determine if you were worthy of getting a CPU that was fixed.

Again, this is secondhand but from people who worked directly in the industry at the time.

Don't divide, Intel inside!