|
|
|
|
|
by datadrivenangel
32 days ago
|
|
Their account was restored in 10 / 19 minutes! It just took 4-6 hours to get everything fully healthy. I look forward to seeing the google response to this hopefully. May 19, 22:10 UTC - Our automated monitoring detected API health check failures and paged our on-calls, who started investigating the issue.
May 19, 22:11 UTC - Dashboard returning 503 errors. Users unable to log in.
May 19, 22:19 UTC - Root cause identified: Google Cloud Platform has suspended Railway's production account.
May 19, 22:22 UTC - P0 ticket filed with Google Cloud. Railway's GCP account manager engaged directly.
May 19, 22:29 UTC - Incident declared.
May 19, 22:29 UTC - GCP account access restored. All compute instances remained stopped and persistent disks inaccessible. |
|
That's external dependency monitoring working exactly as it should. Most teams only monitor their own infrastructure. When a cloud provider, payment gateway, or third-party API fails — your own dashboards show green while users see failures.
The lesson isn't specific to GCP — it's that monitoring what you depend on but don't control is just as important as monitoring what you own.