> As to what can be done to prevent similar failures, the FCC is recommending CenturyLink and other backbone providers take some basic steps, such as disabling unused features on network equipment, installing and maintaining alarms that warn admins when memory or processor use is reaching its peak, and having backup procedures in the event networking gear becomes unreachable.
Disabling unused services? Alarms when nearing resource limits? Contingency plans? How is this the first time this has come up?! These are like security & devops 101.
It's kind of funny. These are best practices for running basic run of the mill web services, even something like a forum or personal homepage. Admittedly there's an obvious, massive difference in complexity, but you would expect the gold standard best practices to come from something mission critical like core Internet services and flow down to less critical services, not the other way around.
Well it is easy to find time to add gold-plating to a small basically useless service, but those guys are probably swamped or try to cut cost by being agile or something.
If one is cynical, it's just a way for the FCC to look like it is doing something. Or, if one attributes great, great rhetorical skill to the FCC, it's their way to lambaste CenturyLink for not even adhering to 101 level principles. I tend to believe the latter.
Network engineer here: clink bridged all of the management controllers on their infinera dwdm shelves together into one multi state sized L2 broadcast domain. Best guess is because it made them easier to SNMP poll and to run other management tools to admin them.
Within the circle of people who really know what went on, we've been laughing at them for months.
The oddest part was that FFC recommendations didn’t include “limit the size of your broadcast domain”.
Large flat L2 are a classic time bomb, with their builders proudly exclaiming “look ma, no hands!” until they painfully get reminded of their mistake :)
I worked for a regional isp where almost the entire metro network was a single layer 2 network up to the point of presences. We regularly (as in at least once a week) had spanning tree loops and broadcast storms that took down the entire city.
In this particular case, some insider info and I am also in possession of the RFO CenturyLink sent out for a number of downed 10GbE transport circuits. From they way they described it a broadcast storm between infinera node controllers propagated uncontrollably across their entire infinera chassis fleet in the western US.
I'd bet it's due to a firmware/software bug triggered by a rare condition, or undefined software behavior trigger by a hardware malfunction. If it's true, it means the root cause would probably never be identified as nobody can reproduce it. It's something pretty scary to think about: We can never guarantee most software would work correctly all the time, empirical testing is often the only practical assessment, and probabilistic bugs such as mysterious crashes cannot be discovered.
But I think the bigger problem is not the packets, but why didn't the backbone reject those malformed packets.
> 3. no expiration time, meaning that the packet would not be dropped for being created too long ago; and
they mean the TTL was set to zero.
From RFC 1812:
> A router MUST NOT originate or forward a datagram with a Time-to-Live (TTL) value of zero.
So a packet with a TTL=0 should never be on the wire (Example a router receives a packet with TTL=1, if it's not destined for that specific router, then it gets discarded). My guess is the switching vendor had bad code that didn't handle TTL=0.
I agree that MPLS would be used for transport through the Infineras, but the article specifically states that this was caused by management traffic.
MPLS doesn't have a concept of a broadcast address and wouldn't have been used for management traffic (except maybe during transit). MPLS is really just used to get IP packets to their destination with less L3 overhead. Full disclosure I work in the DC space, not the provider space so I'm far from an expert on MPLS.
Ethernet famously doesn't have a TTL, so maybe this was just a typical Ethernet broadcast storm. In that case I don't know why TTL would've even been brought up.
They keep throwing around the word packet, which implies layer 3. Of course lots of people say packet when they mean frame.
Edit: There is a comment above saying they have an RFO stating this was a broadcast storm. So it was probably Ethernet and CenturyLink brought up TTL as a way to blame the protocol.
Usually the lowest TTL on the wire is '1' - the next router then subtracts 1, the value is zero, and the packet is dropped on the same router (and icmp sent back).
If someone didn't put an aditiional if() to check, this could cause many problems, especially with broadcasts. And why would they check, if no device sends out packets like this normally (without someone else not doing an if() check, or if someone sent those packets on purpose).
> In the Bureau’s discussions with Infinera, Infinera used the term “packet” to describe what some experts refer to
as Ethernet frames that are sent between nodes. For the sake of simplicity, this report uses the term “packet.”
Correct title is: how misconfigured century link network broke when rotten packet arrived.
This title sounds like it was packet failure, while it is not, it was a matter of time until this problem occurs, hardware must be resilient to malformed input.
Disabling unused services? Alarms when nearing resource limits? Contingency plans? How is this the first time this has come up?! These are like security & devops 101.