Hacker News new | ask | show | jobs
by bcantrill 3732 days ago
Yes, and the "big and gnarly" issues that he alludes to are in fact a driver issue that has been seen only by him and brought up exactly once by him on the mailing list -- and that was a year and a half ago.[1] There was lots of discussion at the time, the conclusions being that (1) he was advocating changes that were deemed unsafe and (2) that his most serious problems were seen on hardware known to be bad. The driver that he's referring to (ixgbe) is in very widespread production on illumos (albeit likely more frequently over optics than the copper that he has deployed); to the degree that there's a driver issue here at all (and that's not a foregone conclusion!), it seems likely that there is something specific to his environment that is inducing it. Certainly, with no one else seeing the issue and without better information from him (e.g., a kernel crash dump that indisputably shows an ixgbe-level issue), it's hard to see how anyone could expect any real progress to be made on this issue -- illumos or otherwise, open source or otherwise.

tl;dr: This in no way represents the "limits of open source" -- but it does highlight the limits of relying on other people to magically solve your problems for you.

[1] https://www.listbox.com/member/archive/182179/2014/10/search...

3 comments

Are you sure your summary of that thread is really accurate?

To me it reads as Chris providing an exceptionally detailed bug report (including the exact code paths triggering the problem, and statistics from lockstat and dtrace on the lock in question). Nobody in the thread asks for more information (why would you want a "crash dump" for a non-crashing bug anyway?). Everyone seems to agree that the drivers are in fact taking spinlocks for long periods of time, while holding other locks. Nobody talks about "hardware known to be bad". What is talked about is how it's been too long since the drivers were last synced with upstream.

It is detailed, but in all the wrong ways: instead of describing the problem that he's seeing and offering data, he has jumped to a code path that he believes is inducing it -- without much in the way of supporting evidence. And yes, he talks about bad HW ("access to the second port on the card currently fails to acquire swfw sync"). The ensuing discussion is more of a desultory wandering than it is a deliberate investigation into his problem -- which isn't surprising, because he hasn't described a problem but merely an observed artifact in the system. (Long lock hold times can easily be misleading; when exploring latency bubbles, one needs to be very careful about tying observed behavior to the latency outliers, lest one discover problems without discovering "the" problem.)

So yes, I stand by my summary of the thread.

I'm using ixgbe (with copper) under SmartOS on three machines right now and have had zero issues.
So he's not willing to generate a kernel crashdump? Is that what you are saying? Not an attack, genuinely curious.
I'm saying that even if someone wanted to debug the problem out of the goodness of their heart, they lack the necessary data to debug it. A kernel crash dump may or may not be required; the discussion too quickly jumped to his (unverified) hypothesis as to the root of the problem to even know what data would be required.