|
|
|
|
|
by ajross
847 days ago
|
|
Ferrari for sure warrants their cars as sold. But if you take it to a mod shop and put in an aftermarket turbo that damages your valves, you don't go whining to HN with an article with "Ferrari Engine Instability" in the title, do you? I don't know what you want Intel to do here. They tell you upfront what the power and clock limits are on the parts. But the market has a three decade history of people pushing the chips a little past their limit for fun and profit, so they "allow" it even if they know it won't work for everything. |
|
I don't have a problem with end users experiencing instability once they manually overclock (that's how it goes), but CPU + mainboard combinations experiencing typical OC symptoms with out-of-the-box settings is just not OK.
This appears to be an arms race between mainboard vendors all going further and further past spec by default because it gives better benchmark and review scores and their competition does it. Intel for their part are themselves also dialing in their parts more aggressively (and, presumably although I don't know for sure, with smaller margins) over time, and they are for sure aware that this is happening, because a) even had they not known already (which they did) they would have learned about this months ago when we first contacted them about this issue, b) technically out of spec or not, as long as it seems to work fine for users and makes their parts look better in reviews, they're not going to complain.
However, it turns out, it does not work fine for at least some small fraction of machines. I have no idea what that percentage is, but it's high enough that googling for say "Intel 13900K crash" yields plenty of relevant results. Some of this will be actual intentional overclockers but, given how boards default to some extend of out-of-spec overclocking enabled, it's unlikely to be all of them.
Meanwhile we (and other SW vendors) are getting a noticeable uptick in crash reports on, specifically, recent K-series Intel CPUs, and it's not something we can sanely work around because the issue manifests as code randomly misbehaving and it's not even when doing anything fancy. The Oodle issue in particular is during LZ77-family decompression, which is to say, all integer arithmetic (not even multiplies, just adds, shifts and logic ops), loads/stores and branches. This is the bare essentials. If it was an issue with say AVX2, we could avoid AVX2 code paths on that family of machines (and preferably figure out what exactly is going wrong so we can come up with a more targeted workaround). But there is no sane plan B for "integer ALU ops, load/stores and branches don't work reliably under load". If we can't rely on that working, there is not enough left for us to work around bugs with!
I realize this all looks like finger-pointing, but this is truly beyond our capacity to work around in a sane way in SW, with what we know so far anyway. Maybe there is a much more specific trigger involved that we could avoid, but if so, we haven't found it yet.
Either way, when it's easy to find end user machines that are crashing at stock settings, things have gone too far and Intel needs to sit down with their HW partners and get everyone (themselves included) to de-escalate.