Hacker News new | ask | show | jobs
by noonespecial 4877 days ago
Kielhofner's response.

http://blog.krisk.org/2013/02/packets-of-death-update.html

I used to use "Lanner" gear for voip and these had embedded intel ethernets. I don't have any more of them to test, but I swear I've seen it on them as well. We suspected power supply problems because the link lights would just go dark every once in a blue moon and need a power cycle to set right, but then we were never be able to reproduce it.

2 comments

I am impressed by his original troubleshooting, but this followup seems impractical. Of his three suggestions, only the third (Intel providing improved board testing tools) even seems like it could possibly prevent this sort of problem. Asking for hardware-enforced "sane" behavior is like asking, "why doesn't my computer know I don't want my program to deadlock, segfault, or loop indefinitely?" That is, if the controller could do that then it would solve the Halting Problem. Improved drivers, his second suggestion, are always a good thing, but drivers only get patched to handle broken hardware in response to the discovery of broken hardware. There is no way to anticipate each particular way a NIC could possibly be broken ahead of time.

The market demands controllers with flexible and expandable functionality. Board manufacturers use the EEPROM to specify exactly what behavior is required. If a particular manufacturer underestimates the importance of correctness and doesn't perform the code review and testing necessary to prevent a PoD, that isn't Intel's fault.

Kristian Kielhofner here - While I understand your analogy I don't think it's an accurate one. In fact, with the release of the successor to the 82574 Intel has already implemented some of the things I suggested:

http://communities.intel.com/community/wired/blog/2012/10/18...

Clearly they have learned from the various EEPROM issues on previous controllers (including the 82574) and implemented (among other things) EEPROM signing, which addresses some (most?) of my concerns about sane hardware behavior. Software drivers already do some basic EEPROM checks on this hardware (I know because I've had to tweak them); I'm simply suggesting these checks go a little further to verify the various EEPROM settings than could potentially result in a scenario like this one. When the effects are as significant as they are here I hope we can all agree: more sanity checking is a good thing.

Let me preface this by saying that your epic trouble-shooting effort was really cool. That's what inspires me to pay so much attention to this.

If you'll excuse my ignorance, could you identify which points made on the linked page correspond to your suggestions? I can see how signature checking, if there is in fact such a mechanism on the controller, can help ensure that an EEPROM image is a member of a particular favored set of such images, but you'll admit that that's a less general approach than "in-hardware sane behavior". I don't know anything about µC design, but it would surprise me if the mistake here were as simple as setting a "die when you see this particular byte sequence" bit. It seems more likely that the behavior is an emergent property based on a combination of flags and coded behavior. I still don't think it would be possible for the controller to prevent that result in general. It is possible to test for bad behavior, as your customers proved. It's also possible for drivers to correctly handle the bad behavior of their hardware, and I'm sure appropriate patches are welcome.

Did your board vendor inform you of Intel's findings back in October? If so, could your original article have been a bit more explicit about the fact that Intel wasn't responsible for this? If not, are you looking for another board vendor?

Thanks!

Let me start by saying that I'm not asking for or expecting perfect hardware or software. This does not exist. I'm looking for improvements. Sane? Let's start with "sane-er". I linked to the i210 because it offers exactly what I'm asking for: improvement (as you'd expect in 4+ years of development).

The link for the i210 was an overview for general consumption. The 862 page datasheet is here:

http://www.intel.com/content/dam/www/public/us/en/documents/...

The description of the various memory and configuration spaces starts around page 53. When compared to what's available in the 82574L this is clearly a substantial improvement.

However, as I say in my update, we still don't /really/ know why this issue manifested the way it did. Without knowing the true underlying cause anything I offer is speculation, as are your suppositions. With that it is unknown as to whether or not the improvements in the i210 would have eliminated or even ameliorated this issue.

As far as catching this exception in driver software? Possible, but doubtful. Working with Intel last fall they seemed to dismiss this possibility. Current drivers report a loss of communication with the PHY and the adapter seems to essentially disappear from the PCI bus until a full power cycle.

Neither Intel nor my board vendor reported these findings to me until this story broke last week. I reported this issue to them last fall: both of them claimed to have never seen this issue before (or since).

Meanwhile, as I’ve said before, other people have consistently reproduced this issue with different board manufacturers. We are pursuing a second source but I'm not going to be any more confident with the second source if it has 82574L controllers. I can't be certain it's going to be any different.

Thanks so much for the detailed response, and good luck in your hunt for better vendors. It seems that it's going to fall to you to test and correct the EEPROM settings. You might want to keep your results to yourself in future; you could probably get some big-money consulting work with other companies forced to use these products. It's so shitty that neither party bothered to respond until you went public with this.
Hey Kris. Welcome to HN. I remember you from Astricons of years past and used astlinux on alix a lot over the years. Good stuff.

I didn't realize you guys were so close. I live just off Televast, about a mile and a half from Star2Star.

> "why doesn't my computer know I don't want my program to deadlock, segfault, or loop indefinitely?"

For a device which can be checked from outside (second subsystem for device self-monitoring), this is actually possible to implement and fairly common. Watchdogs are often implemented to restart automatically when the device is completely unresponsive.

I could imagine a board manufacturer doing something like this for an expensive, low-volume NIC. I still don't see how Intel could help.
> That is, if the controller could do that then it would solve the Halting Problem

The halting problem is actually decidable for limited-memory machines, though you need O(2^n) memory beyond the n-memory of the machine to actually decide it.

We use Lanner gear for VoIP, and have never seen a problem.
I think we only saw it on FW-7550's at the beginning of the production run (the ones with the "snout" fans on the CPU and no case fan).