|
|
|
|
|
by l1n
186 days ago
|
|
Also an engineer on this incident. This was a network routing misconfiguration - an overlapping route advertisement caused traffic to some of our inference backends to be blackholed. Detection took longer than we’d like (about 75 minutes from impact to identification), and some of our normal mitigation paths didn’t work as expected during the incident. The bad route has been removed and service is restored. We’re doing a full review internally with a focus on synthetic monitoring and better visibility into high-impact infrastructure changes to catch these faster in the future. |
|