Hacker News new | ask | show | jobs
by kkielhofner 4877 days ago
Kristian Kielhofner here - While I understand your analogy I don't think it's an accurate one. In fact, with the release of the successor to the 82574 Intel has already implemented some of the things I suggested:

http://communities.intel.com/community/wired/blog/2012/10/18...

Clearly they have learned from the various EEPROM issues on previous controllers (including the 82574) and implemented (among other things) EEPROM signing, which addresses some (most?) of my concerns about sane hardware behavior. Software drivers already do some basic EEPROM checks on this hardware (I know because I've had to tweak them); I'm simply suggesting these checks go a little further to verify the various EEPROM settings than could potentially result in a scenario like this one. When the effects are as significant as they are here I hope we can all agree: more sanity checking is a good thing.

2 comments

Let me preface this by saying that your epic trouble-shooting effort was really cool. That's what inspires me to pay so much attention to this.

If you'll excuse my ignorance, could you identify which points made on the linked page correspond to your suggestions? I can see how signature checking, if there is in fact such a mechanism on the controller, can help ensure that an EEPROM image is a member of a particular favored set of such images, but you'll admit that that's a less general approach than "in-hardware sane behavior". I don't know anything about µC design, but it would surprise me if the mistake here were as simple as setting a "die when you see this particular byte sequence" bit. It seems more likely that the behavior is an emergent property based on a combination of flags and coded behavior. I still don't think it would be possible for the controller to prevent that result in general. It is possible to test for bad behavior, as your customers proved. It's also possible for drivers to correctly handle the bad behavior of their hardware, and I'm sure appropriate patches are welcome.

Did your board vendor inform you of Intel's findings back in October? If so, could your original article have been a bit more explicit about the fact that Intel wasn't responsible for this? If not, are you looking for another board vendor?

Thanks!

Let me start by saying that I'm not asking for or expecting perfect hardware or software. This does not exist. I'm looking for improvements. Sane? Let's start with "sane-er". I linked to the i210 because it offers exactly what I'm asking for: improvement (as you'd expect in 4+ years of development).

The link for the i210 was an overview for general consumption. The 862 page datasheet is here:

http://www.intel.com/content/dam/www/public/us/en/documents/...

The description of the various memory and configuration spaces starts around page 53. When compared to what's available in the 82574L this is clearly a substantial improvement.

However, as I say in my update, we still don't /really/ know why this issue manifested the way it did. Without knowing the true underlying cause anything I offer is speculation, as are your suppositions. With that it is unknown as to whether or not the improvements in the i210 would have eliminated or even ameliorated this issue.

As far as catching this exception in driver software? Possible, but doubtful. Working with Intel last fall they seemed to dismiss this possibility. Current drivers report a loss of communication with the PHY and the adapter seems to essentially disappear from the PCI bus until a full power cycle.

Neither Intel nor my board vendor reported these findings to me until this story broke last week. I reported this issue to them last fall: both of them claimed to have never seen this issue before (or since).

Meanwhile, as I’ve said before, other people have consistently reproduced this issue with different board manufacturers. We are pursuing a second source but I'm not going to be any more confident with the second source if it has 82574L controllers. I can't be certain it's going to be any different.

Thanks so much for the detailed response, and good luck in your hunt for better vendors. It seems that it's going to fall to you to test and correct the EEPROM settings. You might want to keep your results to yourself in future; you could probably get some big-money consulting work with other companies forced to use these products. It's so shitty that neither party bothered to respond until you went public with this.
Hey Kris. Welcome to HN. I remember you from Astricons of years past and used astlinux on alix a lot over the years. Good stuff.

I didn't realize you guys were so close. I live just off Televast, about a mile and a half from Star2Star.