Hacker News new | ask | show | jobs
by kj4ips 2168 days ago
> # Management practices

>

> Cloudflare's core business is networking. It actually embarrasses me to see that Cloudflare YOLO'd a BGP change in a Juniper terminal without peer reviews and/or without a proper administration dashboard, exposing safe(guarded) operations, a simulation engine and co.? In particular, re-routing traffic / bypassing POPs must be a frequent task at scale, how can that not be automated so to avoid human mistakes?

We don't know if this was entirely the case, based on the timeline for the initial incident that prompted the change gone awry, there very well could have been an ITIL-Style CR created and processed within this time.

Judging by the edits made, this wasn't just simply taking a POP out of service entirely, but reducing the amount of (or eliminating all of the) traffic from neighboring POPs sent to compute at the ATL location. I can't image that this exact type of change is all that common. BGP anycast actually makes things significantly more complicated when removing edges.

As far as the mechanics go, with junos's CLI, there's not a lot of difference between what the intended command would have been, and the one that actually happened.

---

What they probably wanted

| example@MX1> configure

|

| {master}[edit]

| example@MX1# edit policy-options policy-statement 6-BBONE- OUT

|

| {master}[edit policy-options policy-statement 6-BBONE-OUT]

| example@MX1# deactivate term 6-SITE-LOCAL

|

| {master}[edit policy-options policy-statement 6-BBONE-OUT]

| example@MX1# commit

---

What might have happened

| example@MX1> configure

|

| {master}[edit]

| example@MX1# edit policy-options policy-statement 6-BBONE- OUT

|

| {master}[edit policy-options policy-statement 6-BBONE-OUT]

| example@MX1# deactivate term 6-SITE-LOCAL from prefix-list 6-SITE-LOCAL

|

| {master}[edit policy-options policy-statement 6-BBONE-OUT]

| example@MX1# commit

---

Initially, this seems like quite a bit of difference, however, Junos has a hyperactive autocomplete that triggers on spaces,. that deactivate could have been as short as "dea ter 6 fr p 6"

I'm not aware of any routing simulation product that is able to simulate complex bgp interactions, and report on effective routes of simulated traffic, as well as CPU load predictions. The closest I am aware of is running GNS3 (or a bunch of VM routers) overnight and capturing SNMP.

On the other hand, automating these kinds of changes would seem trivial, but such a service would have to be as fault tolerant as any other project, but is most certainly a worthwhile endeavor especially since integration is actually relatively easy, Junos provides some nice REST and XML APIs on the management interface that can do pretty much everything the CLI can, except start a shell.

3 comments

Thanks for the detail. There are a lot of people in here who are saying "why didn't they just test their changes before applying them?" and I don't think they really understand how hard that is and how rarely it's done.
Peer review should always be possible, perhaps CF already does it and it got slipped in the review, reviews only reduce errors and not eliminate them.

It is difficult to write automation to cover all the tasks you would do, even if you cover the ones most commonly done, you will have higher risk on the rest.

A linter or higher level instruction set which well tested may be better solution perhaps. Automation if any perhaps should be after that ?

I think the mistake could be assuming that empty "from" statement would not match any routes while in reality deactivating everything inside "from" statement removes it altogether and makes the term match all routes which is indeed somewhat unexpected.