|
(Oodle maintainer here.) This issue only occurs on some small fraction of machines, but on those that we've had access to, it reproduces with BIOS defaults and no user-specified overclocking. It turns out several of these mainboards will overclock and set other values out of spec even at BIOS defaults. I don't have a problem with end users experiencing instability once they manually overclock (that's how it goes), but CPU + mainboard combinations experiencing typical OC symptoms with out-of-the-box settings is just not OK. This appears to be an arms race between mainboard vendors all going further and further past spec by default because it gives better benchmark and review scores and their competition does it. Intel for their part are themselves also dialing in their parts more aggressively (and, presumably although I don't know for sure, with smaller margins) over time, and they are for sure aware that this is happening, because a) even had they not known already (which they did) they would have learned about this months ago when we first contacted them about this issue, b) technically out of spec or not, as long as it seems to work fine for users and makes their parts look better in reviews, they're not going to complain. However, it turns out, it does not work fine for at least some small fraction of machines. I have no idea what that percentage is, but it's high enough that googling for say "Intel 13900K crash" yields plenty of relevant results. Some of this will be actual intentional overclockers but, given how boards default to some extend of out-of-spec overclocking enabled, it's unlikely to be all of them. Meanwhile we (and other SW vendors) are getting a noticeable uptick in crash reports on, specifically, recent K-series Intel CPUs, and it's not something we can sanely work around because the issue manifests as code randomly misbehaving and it's not even when doing anything fancy. The Oodle issue in particular is during LZ77-family decompression, which is to say, all integer arithmetic (not even multiplies, just adds, shifts and logic ops), loads/stores and branches. This is the bare essentials. If it was an issue with say AVX2, we could avoid AVX2 code paths on that family of machines (and preferably figure out what exactly is going wrong so we can come up with a more targeted workaround). But there is no sane plan B for "integer ALU ops, load/stores and branches don't work reliably under load". If we can't rely on that working, there is not enough left for us to work around bugs with! I realize this all looks like finger-pointing, but this is truly beyond our capacity to work around in a sane way in SW, with what we know so far anyway. Maybe there is a much more specific trigger involved that we could avoid, but if so, we haven't found it yet. Either way, when it's easy to find end user machines that are crashing at stock settings, things have gone too far and Intel needs to sit down with their HW partners and get everyone (themselves included) to de-escalate. |
Out of the box with default settings, it was pushing 320W through the CPU in stress tests.
I use my machine for FPGA compiles so I need reliability. I learned that ASUS Multicore Enhancement is not the only thing that must be disabled, you must manually enter the power limits.
Now my compiles take exactly the same length of time but use at least 100W less power.
I am glad to know that with your field data, I've inadvertently sidestepped a potentially catastrophic bug. I don't want to release an FPGA bitstream to users with flipped bits. And the FPGA tools already crash on their own enough.