Hello there. Thought I'd follow up as we have some additional news from our vendor.
The fact that this "should not have happened" doesn't help our customers out, and doesn't make us feel particularly better about the incident we had on Friday.
Just one of our six clusters was affected, but it was a bad day for customers on that cluster, and for that I personally apologize to them.
We're working hard to make sure that this issue is entirely resolved on this, and our other clusters as well, providing Extreme with the core dumps they've requested.
The above warning messages are caused by “System memory depletion” and “watchdog timer” triggered “restarting slot 3 and 4. After crashed, slot 3 and 4 created “core dump file”. We need that files for detail investigation.
At 12:15 PDT today, we experienced a failure of our switching infrastructure on cluster EY02. We are incredibly sorry about this issue and wish to convey all of the information we presently possess.
The result of this failure was blocked IO from some slices to the SAN shelves. Based on the logs and diagnostic output available to engineers, it seems likely a software issue with one of the modules in the switches is at fault. We have captured the necessary log entries from the switches, and we are working with the vendor to identify the cause of the errors. (In case you are curious - our switching infrastructure utilizes 4 stacked switches, all interconnected for full redundancy.)
The fact that this "should not have happened" doesn't help our customers out, and doesn't make us feel particularly better about the incident we had on Friday.
Just one of our six clusters was affected, but it was a bad day for customers on that cluster, and for that I personally apologize to them.
We're working hard to make sure that this issue is entirely resolved on this, and our other clusters as well, providing Extreme with the core dumps they've requested.
Details below:
----------------------------------------------------------------------------------------------------------------------------
08/16/2008 03:53:58.16 <Noti:EPM.wd_warm_reset> Slot-1: Changing to watchdog warm reset mode
08/16/2008 03:09:42.05 <Warn:DM.Warning> Slot-2: Slot-3 FAILED (2) Error on Slot-3
08/16/2008 03:09:38.62 <Warn:DM.Warning> Slot-2: Slot-3 FAILED (1) Error on Slot-3
08/16/2008 03:09:29.23 <Warn:DM.Warning> Slot-1: Slot-3 FAILED (2) Conduit receive error encountered
08/16/2008 03:09:25.17 <Warn:DM.Warning> Slot-1: Slot-3 FAILED (1) Conduit receive error encountered
08/16/2008 03:09:25.17 <Warn:DM.Warning> Slot-1: System Error 0: Conduit receive error encountered
08/16/2008 03:08:22.29 <Warn:DM.Warning> Slot-1: Slot-4 FAILED (1)
08/16/2008 03:08:02.94 <Warn:DM.Warning> Slot-1: Slot-4 FAILED (1)
----------------------------------------------------------------------------------------------------------------------------
The above warning messages are caused by “System memory depletion” and “watchdog timer” triggered “restarting slot 3 and 4. After crashed, slot 3 and 4 created “core dump file”. We need that files for detail investigation.